simulateFastaReads: Generate fasta files for a metatranscriptome gene expression...
In vllorens/metaester: Simulate metatranscriptomics datasets

Description Usage Arguments Details Value References Examples

View source: R/read_distribution_genes.R

simulateFastaReads generates the fasta files for a given metatranscriptome gene expression matrix, calculated with simulateMetaTranscriptome. The fasta files are written to a directory. This is a wrapper to the simulate_experiment_countmat function from the polyester package

simulateFastaReads(
  genomeFileDir,
  simulatedDataSet,
  outdir = ".",
  paired = T,
  seed = 42,
  distr = "empirical",
  error_model = "illumina5",
  bias = "rnaf",
  strand_specific = T
)

`genomeFileDir`	Character string indicating the location of the fasta files for all genomes to be included in the metatranscriptome simulation. The basenames of these fasta files must match the rownames of the genomeReadMatrix composition matrix. See details
`simulatedDataSet`	Metatranscriptome gene expression matrix containing the read counts per gene and per sample. Can be obtained using the function `simulateMetaTranscriptome`
`outdir`	Path to the directory where the fasta files with the simulated reads should be written. Created if not exists.
`paired`	Logical, whether to generate paired-end fasta reads (defaults to TRUE)
`seed`	An integer, sets the random seed for the fasta generation
`distr`	Parameter to be passed to the `simulate_experiment_countmat` function. Controls the distribution of fragment length for paired-end reads. Defaults to "empirical". Check documentation in the polyester package.
`error_model`	Parameter to be passed to the `simulate_experiment_countmat` function. Used to simulate error rates from the sequence. Defaults to "illumina5". Check documentation in the polyester package.
`bias`	Parameter to be passed to the `simulate_experiment_countmat` function. Controls the distribution of reads within the gene/transcript, which may be different depending on the fragmentation step of the library preparation protocol (leading to increased coverage at the 5' or 3' of the transcript). Defaults to "rnaf". Check documentation in the polyester package.
`strand_specific`	Logical. Parameter to be passed to the `simulate_experiment_countmat` function. Should the reads be strand-specific? Check documentation in the polyester package.

This function is a wrapper to the simulate_experiment_countmat function from the polyester package. Reads can be simulated from a multifasta file containing one sequence for each gene. Functionality to work with full genomes + GTF annotation files is not developed. Meanwhile, there are several (external) tools that allow to convert a genome fasta + a GTF file into a transcript fasta, such as gffread from Cufflinks.

Writes the following files into the specified output directory: - Fasta files (*.fasta), one for each sample, containing the reads per gene specified in the metatranscriptomics gene count matrix (simulatedDataSet object) - Gene count matrix (countMatrix.txt): matrix with the counts per gene and per sample, useful to compare the results of mapping/differential expression tools to - Differentially expressed genes list (DEgenes.txt): if DE is set to TRUE, a two-column matrix, the first column indicating gene names and the second column the fold change applied to each gene. Otherwise the file will be empty.

- Alyssa C. Frazee, Andrew E. Jaffe, Rory Kirchner and Jeffrey T. Leek (2018). polyester: Simulate RNA-seq reads. R package version 1.16.0.

# First, define a list of genomes to simulate. The names of these genomes need to match
# the names of the fasta files (without the extension). The genomes used are:
# - F. prausnitzii
# - R. intestinalis
# - L. johnsonii
# - E. faecalis
# - B. obeum
genomesToSimulate <- c("fprausnitzii", "rintestinalis", "ljohnsonii", "efaecalis",
                       "bobeum")

# Then, obtain the empirical composition matrix for this 5 species
compMatrix <- compositionGenomesMetaT(composition="empirical", empiricalSeed=1,
                                   genomes=genomesToSimulate, nReads=500000,
                                   nReplicates=10)

# Obtain the gene expression matrix for the full community (metatranscriptome)
# incorporating differential expression: 10% genes (in each bacterium) have a 2-fold
# overexpression and 10% have a 0.5-fold depletion.
# No composition matrix is provided, so the one from the pasilla dataset will be used.
# As there are 10 samples in the count matrix, we assign 5 cases and 5 controls.
genomesFolder = system.file("extdata", package = "metaester", mustWork = TRUE)
metatranscriptome <- simulateMetaTranscriptome(genomeFileDir=genomesFolder,
                                               genomeReadMatrix=compMatrix, DE=TRUE,
                                               foldChanges=c(0.5,1,2),
                                               foldProbs=c(10,80,10),
                                               nSamples=5, nControls=5)

# Finally, generate the fasta files and write them to the output directory
simulateFastaReads(genomeFileDir=genomesFolder, simulatedDataSet=metatranscriptome,
                   outdir=".")