SparseDOSSA2: Simulate synthetic microbial abundance observations with...

View source: R/SparseDOSSA2.R

SparseDOSSA2R Documentation

Simulate synthetic microbial abundance observations with SparseDOSSA2

Description

SparseDOSSA2 generates synthetic microbial abundance observations from either pre-trained template, or user-provided fitted results from fit_SparseDOSSA2 or fitCV_SparseDOSSA2. Additional options are available for simulating associations between microbial features and metadata variables.

Usage

SparseDOSSA2(
  template = "Stool",
  n_sample = 100,
  new_features = TRUE,
  n_feature = 100,
  spike_metadata = "none",
  metadata_effect_size = 1,
  perc_feature_spiked_metadata = 0.05,
  metadata_matrix = NULL,
  median_read_depth = 50000,
  verbose = TRUE
)

Arguments

template

can be 1) a character string ("Stool", "Vaginal", or "IBD") indicating one of the pre-trained templates in SparseDOSSA2, or 2) user-provided, fitted results. In the latter case this should be an output from fit_SparseDOSSA2 or fitCV_SparseDOSSA2.

n_sample

number of samples to simulate

new_features

TRUE/FALSE indicator for whether or not new features should be simulated. If FALSE then the same set of features in template will be simulated.

n_feature

number of features to simulate. Only relevant when new_features is TRUE

spike_metadata

for metadata spike-in configurations. Must be one of two things: a) ,

  • a character string of "none", "both" "abundance", or "prevalence", indicating whether or not association with metadata will be spiked in. For the spiked-in case, it indicates if features' abundance/prevalence/both characteristics will be associated with metadata (also see explanations for metadata_effect_size and perc_feature_spiked_metadata)

  • a data.frame for detailed spike-in configurations. This is the more advanced approach, where detailed specification for metadata-microbial feature associations are provided. Note: if spike_metadata is provided as a data.frame, then metadata_matrix must be provided as well (cannot be generated automatically). In this case, spike_metadata must have exactly four columns: metadata_datum, feature_spiked, associated_property, and effect_size. Each row of the data.frame configures one specific metadata-microbe association. Specifically:

    • metadata_datum (integer) indicates the column number for the metadata variable to be associated with the microbe

    • feature_spiked (character) indicates the microbe name to be associated with the metadata variable

    • associated_property (character, either "abundance" or "prevalence"), indicating the property of the microbe to be modified. If you want the microbe to be associated with the metadata variable in both properties, include two rows in spike_metadata, one for abundance and one for prevalence

    • effect_size (numeric) indicating the strength of the association. This corresponds to log fold change in non-zero abundance for "abundance" spike-in, and log odds ratio for "prevalence" spike-in

metadata_effect_size

(for when spike_metadata is "abundance", "prevalence", or "both") effect size of the spiked-in associations. This is non-zero log fold change for abundance spike-in, and log odds ratio for prevalence spike-in

perc_feature_spiked_metadata

(for when spike_metadata is "abundance", "prevalence", or "both") percentage of features to be associated with metadata

metadata_matrix

the user can provide a metadata matrix to use for spiking-in of feature abundances. If using default (NULL) two variables will be generated: one continous, and a binary one of balanced cases and controls. Note: if spike_metadata is provided as a data.frame, then the user must provide metadata_matrix too

median_read_depth

targeted median per-sample read depth

verbose

whether detailed information should be printed

Value

a list with the following component:

simulated_data

feature by sample matrix of simulated microbial count observations

simulated_matrices

list of all simulated data matrices, including that of null (i.e. not spiked-in) absolute abundances, spiked-in absolute abundances, and normalized relative abundances

params

parameters used for simulation. These are provided in template.

spike_metadata

list of variables provided or generated for metadata spike-in. This include spike_metadata for the original spike_metadata parameter provided by the user, metadata_matrix for the metadata (either provided by the user or internally generated), and feature_metadata_spike_df for detailed specification of which metadata variables were used to spike-in associations with which features, in what properties at which effect sizes. This is the same as spike_metadata if the latter was provided as a data.frame.

Author(s)

Siyuan Ma, syma.research@gmail.com

Examples

## Using one of the pre-trained SparseDOSSA2 templates:
sim <- SparseDOSSA2(template = "stool", n_sample = 200, new_features = FALSE)
## Using user-provided trained SparseDOSSA2 model:
data("Stool_subset")
fitted <- fit_SparseDOSSA(data = Stool_subset)
sim <- SparseDOSSA2(template = fitted, n_sample = 200, new_features = FALSE)

biobakery/SparseDOSSA2 documentation built on Dec. 3, 2024, 10:17 p.m.