simulateMetaTranscriptome: Calculate reads for one genome, for all samples

Description Usage Arguments Details Value References Examples

View source: R/read_distribution_genes.R

Description

simulateMetaTranscriptome simulates a gene count matrix for an entire metatranscriptome

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
simulateMetaTranscriptome(
  genomeFileDir,
  genomeReadMatrix,
  modelMatrix = NULL,
  DE = F,
  foldChanges = NULL,
  foldProbs = NULL,
  nSamples = NULL,
  nControls = NULL,
  seed = 42
)

Arguments

genomeFileDir

Character string indicating the location of the fasta files for all genomes to be included in the metatranscriptome simulation. The basenames of these fasta files must match the rownames of the genomeReadMatrix composition matrix. See details

genomeReadMatrix

Microbial composition matrix containing the number of reads per genome and per sample. Can be obtained using the function compositionGenomesMetaT

modelMatrix

A composition matrix of gene expression, in which rows represent genes and columns represent replicates. User can provide one of their own, otherwise the matrix from the Pasilla dataset will be used. It's used to fit a zero-inflated negative binomial and set the parameters to randomly assign gene expression to the genes from the microbial genome.

DE

Logical, whether or not to simulate differential expression between cases and controls (defaults to FALSE)

foldChanges

Numeric vector, containing the fold changes to simulate. It should contain the value 1, for genes which are not differentially expressed. Required if DE set to TRUE

foldProbs

Numeric vector, containing the probabilities for each of the fold- changes specified in the parameter foldChanges. Required if DE is set to TRUE. See examples

nSamples

An integer, must be specified if DE is set to TRUE. Number of cases in the simulated experiment. nSamples + nControls must be equal to the number of columns in the composition matrix genomeReadMatrix

nControls

An integer, must be specified if DE is set to TRUE. Number of controls in the simulated experiment. nSamples + nControls must be equal to the number of columns in the composition matrix genomeReadMatrix

seed

An integer, sets the random seed for the read distribution.

Details

This function iterates over all the genomes present in the composition matrix and simulates their corresponding gene expression matrix, putting them all together Valid fasta extensions for the fasta files located in genomeFileDir: *.fa, *.fasta, *.fna, *.genes.fa, *.genes.fasta, *.genes.fna

Value

A list, containing the following elements: - simulationData: a data.frame with the read counts for each gene and each sample. Each row represents a gene and each column a sample. If there is differential expression, column names indicate whether each sample is a case or a control - numSamples: if DE is set to TRUE, the number of cases specified, otherwise NULL - numControls: if DE is set to TRUE, the number of controls, otherwise NULL - DEgenes: if DE is set to TRUE, a two-column data.frame, the first column indicating gene names and the second column the fold change applied to each gene

References

- Huber W, Reyes A (2018). pasilla: Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al. R package version 1.8.0 - Alyssa C. Frazee, Andrew E. Jaffe, Rory Kirchner and Jeffrey T. Leek (2018). polyester: Simulate RNA-seq reads. R package version 1.16.0.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# First, define a list of genomes to simulate. The names of these genomes need to match
# the names of the fasta files (without the extension). The genomes used are:
# - F. prausnitzii
# - R. intestinalis
# - L. johnsonii
# - E. faecalis
# - B. obeum
genomesToSimulate <- c("fprausnitzii", "rintestinalis", "ljohnsonii", "efaecalis",
                       "bobeum")

# Then, obtain the empirical composition matrix for this 5 species
compMatrix <- compositionGenomesMetaT(composition="empirical", empiricalSeed=1,
                                   genomes=genomesToSimulate, nReads=500000,
                                   nReplicates=10)


# Obtain the gene expression matrix for the full community (metatranscriptome)
# In this case, there is no differential expression in any of the bacteria.
# No composition matrix is provided, so the one from the pasilla dataset will be used.
# For this, first indicate the location of the fasta files
genomesFolder = system.file("extdata", package = "metaester", mustWork = TRUE)
metatranscriptome <- simulateMetaTranscriptome(genomeFileDir=genomesFolder,
                                               genomeReadMatrix=compMatrix)

# Obtain the gene expression matrix for the full community (metatranscriptome)
# incorporating differential expression: 10% genes (in each bacterium) have a 2-fold
# overexpression and 10% have a 0.5-fold depletion.
# No composition matrix is provided, so the one from the pasilla dataset will be used.
# As there are 10 samples in the count matrix, we assign 5 cases and 5 controls.
metatranscriptome <- simulateMetaTranscriptome(genomeFileDir=genomesFolder,
                                               genomeReadMatrix=compMatrix, DE=TRUE,
                                               foldChanges=c(0.5,1,2),
                                               foldProbs=c(10,80,10),
                                               nSamples=5, nControls=5)

vllorens/metaester documentation built on April 26, 2020, 6:55 p.m.