dacomp.generate_example_dataset.two_sample: Generate a simulated two sample dataset, based on data from...

Description Usage Arguments Details Value References Examples

View source: R/dacomp_generate_example_data.R

Description

This function generates a two-sample dataset, based on the kostic dataset (Kostic et. al. 2012) from the phyloseq package (McMurdie et. al. 2012). Simulated data is generated in a procedure similar to the one presented in Brill et. al. 2019, Subsection 4.1. See additionals details below.

Usage

1
2
3
4
5
6
dacomp.generate_example_dataset.two_sample(
  n_X = 30,
  n_Y = 30,
  m1 = 30,
  signal_strength_as_change_in_microbial_load = 0.1
)

Arguments

n_X

Number of samples from the first group

n_Y

Number of samples from the second group

m1

Number of differentially abundant taxa

signal_strength_as_change_in_microbial_load

A number in the range 0-0.75, indicating the fraction of the microbial load of group Y that is added due to the simulated condition. The complement of this fraction, is the fraction of the microbial load of group Y that is distribued across taxa as in group X.

Details

Data is generated as follows. In the first step, we generate a list of vectors of relative frequencies to sample from: only healthy subjects from the kostic colorectal dataset are selected. Samples with less than 500 reads are dropped. Only OTUs that appear in 2 or more subjects are retained. In the second step, samples for group X are generated. For each sample, a vector of frequencies is chosen at random from the list generated in the first step. The observed sampled are multinomial random variables with a probability vector matching the selected frequencies, and a total number of reads realized from a Poisson distribution with a mean number of reads equal to the median number of reads across the samples listed in the first step. In the third step, samples for group Y are generated. For each sample, a vector of frequencies is chosen at random, similar to group X. The frequencies of differentially abundant taxa is increased, with the increase realized from a poisson random variable, such that the total increase in microbial load across all differentially abundant taxa is equivlant to the signal strength specified by the user. Observed counts are sampled based on the updated frequencies. This function requires the phyloseq package from bioconductor.

Value

a list with the following entries

References

Brill, Barak, Amnon Amir, and Ruth Heller. 2019. Testing for Differential Abundance in Compositional Counts Data, with Application to Microbiome Studies. arXiv Preprint arXiv:1904.08937.

Kostic, Aleksandar D, Dirk Gevers, Chandra Sekhar Pedamallu, Monia Michaud, Fujiko Duke, Ashlee M Earl, Akinyemi I Ojesina, et al. 2012. Genomic Analysis Identifies Association of Fusobacterium with Colorectal Carcinoma. Genome Research 22 (2). Cold Spring Harbor Lab: 292–98.

McMurdie, Paul J, and Susan Holmes. 2013. Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PloS One 8 (4). Public Library of Science: e61217.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
## Not run: 
library(dacomp)

set.seed(1)
data = dacomp.generate_example_dataset.two_sample(m1 = 100,
       n_X = 50,
       n_Y = 50,
       signal_strength_as_change_in_microbial_load = 0.1)




## End(Not run) 

barakbri/dacomp documentation built on June 17, 2021, 11:20 p.m.