simulateFastaReads: Generate fasta files for a metatranscriptome gene expression...

Description Usage Arguments Details Value References Examples

View source: R/read_distribution_genes.R

Description

simulateFastaReads generates the fasta files for a given metatranscriptome gene expression matrix, calculated with simulateMetaTranscriptome. The fasta files are written to a directory. This is a wrapper to the simulate_experiment_countmat function from the polyester package

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
simulateFastaReads(
  genomeFileDir,
  simulatedDataSet,
  outdir = ".",
  paired = T,
  seed = 42,
  distr = "empirical",
  error_model = "illumina5",
  bias = "rnaf",
  strand_specific = T
)

Arguments

genomeFileDir

Character string indicating the location of the fasta files for all genomes to be included in the metatranscriptome simulation. The basenames of these fasta files must match the rownames of the genomeReadMatrix composition matrix. See details

simulatedDataSet

Metatranscriptome gene expression matrix containing the read counts per gene and per sample. Can be obtained using the function simulateMetaTranscriptome

outdir

Path to the directory where the fasta files with the simulated reads should be written. Created if not exists.

paired

Logical, whether to generate paired-end fasta reads (defaults to TRUE)

seed

An integer, sets the random seed for the fasta generation

distr

Parameter to be passed to the simulate_experiment_countmat function. Controls the distribution of fragment length for paired-end reads. Defaults to "empirical". Check documentation in the polyester package.

error_model

Parameter to be passed to the simulate_experiment_countmat function. Used to simulate error rates from the sequence. Defaults to "illumina5". Check documentation in the polyester package.

bias

Parameter to be passed to the simulate_experiment_countmat function. Controls the distribution of reads within the gene/transcript, which may be different depending on the fragmentation step of the library preparation protocol (leading to increased coverage at the 5' or 3' of the transcript). Defaults to "rnaf". Check documentation in the polyester package.

strand_specific

Logical. Parameter to be passed to the simulate_experiment_countmat function. Should the reads be strand-specific? Check documentation in the polyester package.

Details

This function is a wrapper to the simulate_experiment_countmat function from the polyester package. Reads can be simulated from a multifasta file containing one sequence for each gene. Functionality to work with full genomes + GTF annotation files is not developed. Meanwhile, there are several (external) tools that allow to convert a genome fasta + a GTF file into a transcript fasta, such as gffread from Cufflinks.

Value

Writes the following files into the specified output directory: - Fasta files (*.fasta), one for each sample, containing the reads per gene specified in the metatranscriptomics gene count matrix (simulatedDataSet object) - Gene count matrix (countMatrix.txt): matrix with the counts per gene and per sample, useful to compare the results of mapping/differential expression tools to - Differentially expressed genes list (DEgenes.txt): if DE is set to TRUE, a two-column matrix, the first column indicating gene names and the second column the fold change applied to each gene. Otherwise the file will be empty.

References

- Alyssa C. Frazee, Andrew E. Jaffe, Rory Kirchner and Jeffrey T. Leek (2018). polyester: Simulate RNA-seq reads. R package version 1.16.0.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# First, define a list of genomes to simulate. The names of these genomes need to match
# the names of the fasta files (without the extension). The genomes used are:
# - F. prausnitzii
# - R. intestinalis
# - L. johnsonii
# - E. faecalis
# - B. obeum
genomesToSimulate <- c("fprausnitzii", "rintestinalis", "ljohnsonii", "efaecalis",
                       "bobeum")

# Then, obtain the empirical composition matrix for this 5 species
compMatrix <- compositionGenomesMetaT(composition="empirical", empiricalSeed=1,
                                   genomes=genomesToSimulate, nReads=500000,
                                   nReplicates=10)

# Obtain the gene expression matrix for the full community (metatranscriptome)
# incorporating differential expression: 10% genes (in each bacterium) have a 2-fold
# overexpression and 10% have a 0.5-fold depletion.
# No composition matrix is provided, so the one from the pasilla dataset will be used.
# As there are 10 samples in the count matrix, we assign 5 cases and 5 controls.
genomesFolder = system.file("extdata", package = "metaester", mustWork = TRUE)
metatranscriptome <- simulateMetaTranscriptome(genomeFileDir=genomesFolder,
                                               genomeReadMatrix=compMatrix, DE=TRUE,
                                               foldChanges=c(0.5,1,2),
                                               foldProbs=c(10,80,10),
                                               nSamples=5, nControls=5)

# Finally, generate the fasta files and write them to the output directory
simulateFastaReads(genomeFileDir=genomesFolder, simulatedDataSet=metatranscriptome,
                   outdir=".")

vllorens/metaester documentation built on April 26, 2020, 6:55 p.m.