There are two main simulators to generate RNA-Seq data: the R package polyester
and
the stand-alone RSEM package, which has the rsem-simulate-reads
program.
The two main pieces of information necessary to run a simulation are (A) a set of sequences and (B) information about how many reads to simulate from each group. Current pipelines have set up protocols for estimating the reads per feature from real data and then simulating differential expression between groups.
The problem arises where the simulation assumes that RNA-Seq is considered count data rather than compositional data. In a typical RNA-Seq experiment, the number of fragments sequenced in each sample (i.e. the "depth of sequencing") is kept as consistent as possible across all samples. Yet, depending on the combination simulated fold changes and initial count estimates, there is a virtual guarantee that there will be a systematic bias in the total expected reads per sample between the two groups before the simulation occurs.
Take for example the independent effect simulation performed in the original sleuth paper [@Pimentel2017]. In this experiment, they simulated 20% (~13K) of filtered transcripts as differentially expressed, with an equal probability of being up- or down-regulated. When you examine the total counts per sample in Supplementary Table 2, one can see that there is a systematic bias present between conditions, where the "B" condition has ~15-20% more counts on average than the "A" condition. This results in the library depth being perfectly confounded with the experimental condition.
A second example of a more subtle bias can be observed in a recent pipeline paper [@Love2018]. In this experiment, they simulated ~20% (~9K) of expressed transcripts as differentially expressed (DTE), with some also counting for differential gene expression (DGE) and differential transcript usage (DTU). Carefully reviewing the processed data (available for download at https://doi.org/10.5281/zenodo.1410443, contained in the 'mikelove-swimdown-216a1dd/simulate/data/simulate.rda' RData file), there are ~12% more TPMs (~6% more counts) in the second condition versus the first.
In both of these simulations, a moderate percentage of expressed transcripts are differentially expressed. However, there are a few well-documented examples of extreme biological change: (A) comparing cancer cells to control cells with many up-regulated transcripts ([@Loven2012]), and (B) comparing nitrogen-starved yeast cells to cells grown in control media with many down-regulated transcripts ([@Marguerat2012]; a re-analysis clearly demonstrating the problem using standard analysis on this data is found at [@Lovell2015]). If one attempts to simulate these or other scenarios, the systematic bias would yield unrealistic systematic differences in library depths between groups.
The reason for the observed systematic biases is due to the data being treated as count data rather than
compositional data. The absSimSeq
package seeks to solve this problem by starting from copy numbers per cell,
simulating fold changes on those copy numbers, and then applying compositional principles to yield TPMs and
expected fragments per transcript. It then uses polyester
to do the underlying simulation.
This protocol allows the user to simulate any combination of expression values and fold changes, while still
being able to simulate equal library sizes across all samples.
This vignette will go through the pipeline for running a simulation using absSimSeq
. For more details about
compositional data analysis, please see [@Quinn2018] and our forthcoming preprint.
The current stable release for the polyester
package has three important limitations:
1. It does not allow for parallelization to take full advantage of machine resources
2. Its algorithm for introducing error is very slow, taking several hours to generate one sample
3. It does not shuffle fragments before writing to file, preventing the use of most alignment tools,
which expect a random order of fragments.
These limitations have been addressed in a forked version of polyester which is available using the following installation code:
devtools::install_github('warrenmcg/polyester')
This is not required to use absSimSeq, but will simplify and speed up the simulation protocol.
Note: as of this writing, this forked version has not been incorporated into the stable release, and is provided without guarantee, so use your own discretion when using it.
yeastStarvationData
To illustrate how to use absSimSeq
, we will use data from Marguerat et al. [@Marguerat2012] in fission
yeast (Schizosaccharomyces pombe). The transcriptome and kallisto results are in the
yeastStarvationData
package. Load it up using the
following code:
library(yeastStarvationData) fasta_file <- system.file("extdata", "S.pombe.all.fa.gz", package = "yeastStarvationData") data(yeastS2C) data(yeastAnnos) kal_dirs <- list.dirs(system.file("extdata", package = "yeastStarvationData"))[-1] yeastS2C$path <- kal_dirs yeast_so <- sleuth::sleuth_prep(yeastS2C, ~condition, target_mapping = yeastAnnos)
To launch an absSimSeq simulation, two main inputs are required:
A. either an RDS file storing a sleuth object (saved using sleuth_save
) or
the sleuth object itself
B. a FASTA sequence with the desired features from which to simulate
All of the target_ids from the sleuth_file must match headers from FASTA file (with exceptions for sequences downloaded with "|" or " " used as separators for metadata).
To run an absSimSeq, run the following command:
sim_res <- run_abs_simulation(yeast_so, fasta_file = fasta_file)
Coming soon!
mamabear
Coming soon!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.