knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Metaester is an R package to simulate metatranscriptomics datasets.

Metaester is built from Polyester, an R package to simulate RNA-seq experiments, hence its name. Many of the functions in Metaester are wrappers to Polyester functions. However, we have not exploited all the possible functionalities of polyester yet.

Metaester allows to simulate an entire microbial community, by first distributing the RNA-seq reads among the different genomes present in the simulated sample and then distributing the RNA-seq reads among the different genes, within each simulated genome.

Installation

Metaester is built upon the devel version of Polyester (as of July 2018, not all changes in the devel version were pushed to the bioC version). Therefore, to use Metaester, we strongly recommend to install the devel version of Polyester:

devtools::install_github("alyssafrazee/polyester")

It is also required to install the Pasilla dataset manually:

source("https://bioconductor.org/biocLite.R")
biocLite("pasilla")

To install Metaester, run the following code:

devtools::install_github("vllorens/metaester")

The metaester workflow

Metaester is built with the aim of simulating metatranscriptomics datasets as simply as possible. Only 4 steps are required to simulate an experiment:

Metaester examples

library(metaester)

This first set of examples uses an empirical distribution for the community composition, modeled as a negative binomial distribution.

Example 1: 10 replicates, NO differential composition, NO differential expression

Step 0: define the parameters of the experiment

In this case, we will simulate 10 samples with the same microbial composition and gene expression profiles, so the only differences among them will be due to the process of sampling. Those would be analogous to technical replicates in sequencing.

# load the genomes of 5 gut bacteria and list them
path_to_fastas <- system.file("extdata", package = "metaester", mustWork = TRUE) 
genomeList <- list.files(path_to_fastas, pattern = ".genes.fna")                 
genomeList <- gsub(genomeList, pattern = ".genes.fna", replacement = "")             

genomeList

# set random seed for the genome composition
random_seed <- 1      

# set number of replicates and reads to simulate.
# as there's no differential expression among samples no need to specify number of cases and controls
samples <- 10          
reads_perSample <- 10000

# specify output directory
output_simulation <- "output_data"

Step 1: Generate genome read distribution.

Here, we use the function compositionGenomesMetaT(). We will distribute the reads among the 5 genomes using an empirical distribution (i.e. a negative binomial, derived from real data).

genomeComp <- compositionGenomesMetaT(composition = "empirical", empiricalSeed = random_seed, 
                                      nReplicates = samples, genomes = genomeList, 
                                      nReads = reads_perSample)

# print the microbial composition table to see that differences in abundances are only due to sampling
genomeComp

Step 2: Generate gene read distribution.

Here, we use the function simulateMetaTranscriptome(). This function takes the reads assigned to each genome and the list of genes within this genome (extracted from the FASTA file), and then distributes the reads among the different genes of that genome. This function does this for the 5 genomes in our list and for the 10 samples and aggregates the output, resulting in a single gene count matrix, in which each row represents a single gene from any of this 5 bacteria and each column represents one of the 10 samples.

metaT_experiment=simulateMetaTranscriptome(genomeFileDir = path_to_fastas, genomeReadMatrix = genomeComp)

Step 3: Generate the FASTA files.

Here, we use the function simulateFastaReads(). This function takes the output from the previous one, as well as the gene FASTAS from each genome. It writes the simulated reads fasta files to the output directory (2 fasta files per sample in paired-end mode or 1 in single-end mode). It also writes the gene count matrix in the same directory, and a list of differentially expressed genes.

## Step 4: Generate the fasta files (function simulateFastaReads)
simulateFastaReads(genomeFileDir = path_to_fastas, outdir = output_simulation, 
                   simulatedDataSet = metaT_experiment)

Example 2: 10 replicates, differential composition (between cases and controls), NO differential expression

Step 0: define the parameters of the experiment

In this case, we will simulate 10 samples in two groups of 5, in which each group has a different microbial composition. Here, bacteria change their relative abundances from one group to another, but the functions they perform are the same, as there is no differential expression.

# load the genomes of 5 gut bacteria and list them
path_to_fastas <- system.file("extdata", package = "metaester", mustWork = TRUE) 
genomeList <- list.files(path_to_fastas, pattern = ".genes.fna")                 
genomeList <- gsub(genomeList, pattern = ".genes.fna", replacement = "")             

genomeList

# set random seed for the genome composition (for 10 samples, we generate the 2 groups)
random_seed <- rep(1:2, each=5)     

# set number of replicates and reads to simulate.
# as there's no differential expression among samples no need to specify number of cases and controls
samples <- 10          
reads_perSample <- 10000

# specify output directory
output_simulation <- "output_data"

Step 1: Generate genome read distribution.

Here, we use the function compositionGenomesMetaT(). We will distribute the reads among the 5 genomes using an empirical distribution (i.e. a negative binomial, derived from real data).

genomeComp <- compositionGenomesMetaT(composition = "empirical", empiricalSeed = random_seed, 
                                      nReplicates = samples, genomes = genomeList, 
                                      nReads = reads_perSample)

# print the microbial composition table to see the differences between the two groups
genomeComp

Step 2: Generate gene read distribution.

Here, we use the function simulateMetaTranscriptome(). This function takes the reads assigned to each genome and the list of genes within this genome (extracted from the FASTA file), and then distributes the reads among the different genes of that genome. This function does this for the 5 genomes in our list, and for the 10 samples, and aggregates the output, resulting in a single gene count matrix, in which each row represents a single gene from any of this 5 bacteria and each column represents one of the 10 samples.

metaT_experiment=simulateMetaTranscriptome(genomeFileDir = path_to_fastas, genomeReadMatrix = genomeComp)

Step 3: Generate the FASTA files.

Here, we use the function simulateFastaReads(). This function takes the output from the previous one, as well as the gene FASTAS from each genome. It writes the simulated reads fasta files to the output directory (2 fasta files per sample in paired-end mode or 1 in single-end mode). It also writes the gene count matrix in the same directory, and a list of differentially expressed genes.

simulateFastaReads(genomeFileDir = path_to_fastas, outdir = output_simulation,
                   simulatedDataSet = metaT_experiment)

Example 3: 10 replicates, differential composition across all samples, NO differential expression

Step 0: define the parameters of the experiment

In this case, we will simulate 10 samples, but each sample has a different bacterial composition. This would be equivalent to sequencing samples from different individuals. Here, bacteria change their relative abundances from one sample to another, but the functions they perform are the same, as there is no differential expression.

# load the genomes of 5 gut bacteria and list them
path_to_fastas <- system.file("extdata", package = "metaester", mustWork = TRUE) 
genomeList <- list.files(path_to_fastas, pattern = ".genes.fna")                 
genomeList <- gsub(genomeList, pattern = ".genes.fna", replacement = "")             

genomeList

# set random seed for the genome composition (for 10 samples, a different seed per sample)
random_seed <- 1:10     

# set number of samples and reads to simulate.
# as there's no differential expression among samples no need to specify number of cases and controls
samples <- 10          
reads_perSample <- 10000

# specify output directory
output_simulation <- "output_data"

Step 1: Generate genome read distribution.

Here, we use the function compositionGenomesMetaT(). We will distribute the reads among the 5 genomes using an empirical distribution (i.e. a negative binomial, derived from real data).

genomeComp <- compositionGenomesMetaT(composition = "empirical", empiricalSeed = random_seed, 
                                      nReplicates = samples, genomes = genomeList, 
                                      nReads = reads_perSample)

Step 2: Generate gene read distribution.

Here, we use the function simulateMetaTranscriptome(). This function takes the reads assigned to each genome and the list of genes within this genome (extracted from the FASTA file), and then distributes the reads among the different genes of that genome. This function does this for the 5 genomes in our list, and for the 10 samples, and aggregates the output, resulting in a single gene count matrix, in which each row represents a single gene from any of this 5 bacteria and each column represents one of the 10 samples.

metaT_experiment=simulateMetaTranscriptome(genomeFileDir = path_to_fastas, genomeReadMatrix = genomeComp)

Step 3: Generate the FASTA files.

Here, we use the function simulateFastaReads(). This function takes the output from the previous one, as well as the gene FASTAS from each genome. It writes the simulated reads fasta files to the output directory (2 fasta files per sample in paired-end mode or 1 in single-end mode). It also writes the gene count matrix in the same directory, and a list of differentially expressed genes.

simulateFastaReads(genomeFileDir = path_to_fastas, outdir = output_simulation,
                   simulatedDataSet = metaT_experiment)

Example 4: 10 replicates, NO differential composition, 10% of genes are differentially expressed

Step 0: define the parameters of the experiment

In this case, we will simulate 10 samples with the same bacterial composition. However, we will divide them in between 5 cases and 5 controls. 10% of the genes of each bacterium will be differentially expressed between the two conditions.

# load the genomes of 5 gut bacteria and list them
path_to_fastas <- system.file("extdata", package = "metaester", mustWork = TRUE) 
genomeList <- list.files(path_to_fastas, pattern = ".genes.fna")                 
genomeList <- gsub(genomeList, pattern = ".genes.fna", replacement = "")             

genomeList

# set random seed for the genome composition (for 10 samples, a different seed per sample)
random_seed <- 1

# set number of samples and reads to simulate.
# in this case there is differential expression, so we need to specify the number of cases and controls
samples <- 10         
nCases <- 5
nControls <- 5
reads_perSample <- 10000

# differentially expresed genes will have any of these fold changes
# X > 1 overexpression
# X < 1 depletion
# X = 1 no changes (it is important to include that some genes will have no changes)
fold_changes=c(0.25,0.5,1,2,4) 

# after specifying fold changes, it is required to specify the probability of each of them in order
# probabilities don't need to add up to 100, but it makes it easier to interpret. In this case: 
# 2.5% probability of 0.25-fold
# 2.5% probability of 0.5-fold
# 90% probability of 1-fold (no changes)
# 2.5% probability of 2-fold
# 2.5% probability of 4-fold
fold_probs=c(2.5,2.5,90,2.5,2.5) 

output_simulation <- "output_data"

Step 1: Generate genome read distribution.

Here, we use the function compositionGenomesMetaT(). We will distribute the reads among the 5 genomes using an empirical distribution (i.e. a negative binomial, derived from real data).

genomeComp <- compositionGenomesMetaT(composition = "empirical", empiricalSeed = random_seed, 
                                      nReplicates = samples, genomes = genomeList, 
                                      nReads = reads_perSample)

Step 2: Generate gene read distribution.

Here, we use the function simulateMetaTranscriptome(). This function takes the reads assigned to each genome and the list of genes within this genome (extracted from the FASTA file), and then distributes the reads among the different genes of that genome. This function does this for the 5 genomes in our list, and for the 10 samples, and aggregates the output, resulting in a single gene count matrix, in which each row represents a single gene from any of this 5 bacteria and each column represents one of the 10 samples. This function also simulates differential expression. In this case, after generating the gene count matrix, it will assign any of the fold_changes randomly (according to the specified fold_probs) to each of the genes, and it will multiply the corresponding case columns for a gene by its assigned fold change. Therefore more arguments need to be specified in this case.

metaT_experiment=simulateMetaTranscriptome(genomeFileDir = path_to_fastas, genomeReadMatrix = genomeComp, 
                                           DE = T, foldChanges = fold_changes, foldProbs = fold_probs, 
                                           nSamples = nCases,  nControls = nControls)

Step 3: Generate the FASTA files.

Here, we use the function simulateFastaReads(). This function takes the output from the previous one, as well as the gene FASTAS from each genome. It writes the simulated reads fasta files to the output directory (2 fasta files per sample in paired-end mode or 1 in single-end mode). It also writes the gene count matrix in the same directory, and a list of differentially expressed genes.

## Step 4: Generate the fasta files (function simulateFastaReads)
simulateFastaReads(genomeFileDir = path_to_fastas, outdir = output_simulation, 
                   simulatedDataSet = metaT_experiment)

Similar code can be easily written to generate simulated datasets with different microbial composition AND differential expression, by combining examples 2/3 with example 4.

Session info:

sessionInfo()


vllorens/metaester documentation built on April 26, 2020, 6:55 p.m.