simulate_experiment: simulate RNA-seq experiment using negative binomial model
In polyester: Simulate RNA-seq reads

Description Usage Arguments Details Value References Examples

create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups

1
2
3

simulate_experiment(fasta = NULL, gtf = NULL, seqpath = NULL,
  outdir = ".", num_reps = c(10, 10), reads_per_transcript = 300,
  size = NULL, fold_changes, paired = TRUE, reportCoverage = FALSE, ...)

`fasta`	path to FASTA file containing transcripts from which to simulate reads. See details.
`gtf`	path to GTF file containing transcript structures from which reads should be simulated. See details.
`seqpath`	path to folder containing one FASTA file (`.fa` extension) for each chromosome in `gtf`. See details.
`outdir`	character, path to folder where simulated reads should be written, with no slash at the end. By default, reads are written to current working directory.
`num_reps`	How many biological replicates should be in each group? The length `num_reps` determines how many groups are in the experiment. For example, `num_reps = c(5,6,5)` specifies a 3-group experiment with 5 samples in group 1, 6 samples in group 2, and 5 samples in group 3. Defaults to a 2-group experiment with 10 reps per group (i.e., `c(10,10)`).
`reads_per_transcript`	baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in `fasta`. Default 300. You can also leave `reads_per_transcript` empty and set `meanmodel=TRUE` to draw baseline mean numbers from a model based on transcript length.
`size`	the negative binomial `size` parameter (see `NegBinomial`) for the number of reads drawn per transcript. It can be a matrix (where the user can specify the size parameter per transcript, per group), a vector (where the user can specify the size per transcript, perhaps relating to reads_per_transcript), or a single number, specifying the size for all transcripts and groups. If left NULL, defaults to `reads_per_transcript * fold_changes / 3`. Negative binomial variance is mean + mean^2 / size.
`fold_changes`	Matrix specifying multiplicative fold changes between groups. There is no default, so you must provide this argument. In real data sets, lowly-expressed transcripts often show high fold changes between groups, so this can be kept in mind when setting `fold_changes` and `reads_per_transcript`. This argument must have the same number of columns as there are groups as specified by `num_reps`, and must have the same number of rows as there are transcripts in `fasta`. A fold change of X in matrix entry i,j means that for replicate j, the baseline mean number of reads (reads_per_transcript[i]) will be multiplied by X. Note that the multiplication happens before the negative binomial value (for the number of reads that actually will be drawn from transcript i, for replicate j) is drawn. This argument is ignored if `length(num_reps)` is 1 (meaning you only have 1 group in your simulation).
`paired`	If `TRUE`, paired-end reads are simulated; else single-end reads are simulated. Default `TRUE`
`reportCoverage`	whether to write out coverage information to `sample_coverages.rda` file in the `outdir`. defaults to `FALSE`
`...`	any of several other arguments that can be used to add nuance to the simulation. See details.

Reads can either be simulated from a FASTA file of transcripts (provided with the fasta argument) or from a GTF file plus DNA sequences (provided with the gtf and seqpath arguments). Simulating from a GTF file and DNA sequences may be a bit slower: it took about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X, and Y in hg19.

Several optional parameters can be passed to this function to adjust the simulation. The options are:

readlen: read length. Default 100.
lib_sizes: Library size factors for the biological replicates. lib_sizes should have length equal to the total number of replicates in the experiment, i.e., sum(num_reps). For each replicate, once the number of reads to simulate from each transcript for that replicate is known, all read numbers across all transcripts from that replicate are multiplied by the corresponding entry in lib_sizes.
distr One of 'normal', 'empirical', or 'custom', which specifies the distribution from which to draw RNA fragment lengths. If 'normal', draw fragment lengths from a normal distribution. You can provide the mean of that normal distribution with fraglen (defaults to 250) and the standard deviation of that normal distribution with fragsd (defaults to 25). You can provide a single number for each, or a vector with length equal to the total number of samples. If 'empirical', draw fragment lengths from a fragment length distribution estimated from a real data set. If 'custom', draw fragment lengths from a custom distribution, which you can provide as the custdens argument. custdens should be a density fitted using logspline.
error_model: The error model can be one of:
- 'uniform': errors are distributed uniformly across reads. You can also provide an 'error_rate' parameter, giving the overall probability of making a sequencing error at any given nucleotide. This error rate defaults to 0.005.
- 'illumina4' or 'illumina5': Empirical error models. See ?add_platform_error for more information.
- 'custom': A custom error model you've estimated from an RNA-seq data set using GemErr. See ?add_platform_error for more info. You will need to provide both model_path and model_prefix if using a custom error model. model_path is the output folder you provided to build_error_model.py. This path should contain either two files suffixed _mate1 and _mate2, or a file suffixed _single. model_prefix is the 'prefix' argument you provided to build_error_model.py and is whatever comes before the _mate1/_mate2 or _single files in model_path.
bias One of 'none', 'rnaf', or 'cdnaf'. 'none' represents uniform fragment selection (every possible fragment in a transcript has equal probability of being in the experiment); 'rnaf' represents positional bias that arises in protocols using RNA fragmentation, and 'cdnaf' represents positional bias arising in protocols that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model, coverage is higher in the middle of the transcript and lower at both ends, and in the 'cdnaf' model, coverage increases toward the 3' end of the transcript. The probability models used come from Supplementary Figure S3 of Li and Jiang (2012). Defaults to 'none' if you don't provide this.
gcbias list indicating which samples to add GC bias to, and from which models. Should be the same length as sum(num_reps); entries can be either numeric or of class loess. A numeric entry of 0 indicates no GC bias. Numeric entries 1 through 7 correspond to the 7 empirical GC models that ship with Polyester, estimated from GEUVADIS HapMap samples NA06985, NA12144, NA12776, NA18858, NA20542, NA20772, and NA20815, respectively. The code used to derive the empirical GC models is available at https://github.com/alyssafrazee/polyester/blob/master/make_gc_bias.R. A loess entry should be a loess prediction model that takes a GC content percent value (between 0 and 1) a transcript's deviation from overall mean read count based on that GC value. Counts for each replicate will be adjusted based on the GC bias model specified for it. Numeric and loess entries can be mixed. By default, no bias is included.
frag_GC_bias Either a matrix of dimensions 101 x sum(num_reps) or 'none'. The default is 'none'. If specified, the matrix contains the probabilities (a number in the range [0,1]) that a fragment will appear in the output given its GC content. The first row corresponds to a fragment with GC content of 0 percent, the second row 1 percent, the third row 2 percent, etc., and the last row 100 percent. The columns correspond to different probabilites for each sample. Internally, a coin flip (a Bernoulli trial) determines if each fragment is kept, depending on its GC content. Note that the final library size will depend on the elements of the matrix, and it might make sense to scale up the lib_size of the samples with low probabilites in the matrix in the range of the transcriptome GC content distribution. Note that the count_matrix written to outdir contains the counts before applying fragment GC bias.
strand_specific defaults to FALSE, which means fragments are generated with equal probability from both strands of the transcript sequence. set to TRUE for strand-specific simulation (1st read forward strand, 2nd read reverse strand with respect to transcript sequence).
meanmodel: set to TRUE if you'd like to set reads_per_transcripts as a function of transcript length. We fit a linear model regressing transcript abundance on transcript length, and setting meanmodel=TRUE means we will use transcript lengths to draw transcript abundance based on that linear model. You can see our modeling code at http://htmlpreview.github.io/?https://github.com/alyssafrazee/polyester_code/blob/master/length_simulation.html
write_info: set to FALSE if you do not want files of simulation information written to disk. By default, transcript fold changes and expression status, replicate library sizes and group identifiers, and an R data object of the counts matrix (before application of fragment GC bias) are written to outdir.
seed: specify a seed (e.g. seed=142 or some other integer) to set before randomly drawing read numbers, for reproducibility.
transcriptid: optional vector of transcript IDs to be written into sim_info.txt and used as transcript identifiers in the output fasta files. Defaults to names(readDNAStringSet(fasta)). This option is useful if default names are very long or contain special characters.
gzip: pass gzip=TRUE to write gzipped fasta files as output (by default, fasta output files are not compressed when written to disk).
exononly: (passed to seq_gtf) if TRUE (as it is by default), only create transcript sequences from the features labeled exon in gtf.
idfield: (passed to seq_gtf)in the attributes column of gtf, what is the name of the field identifying transcripts? Should be character. Default "transcript_id".
attrsep: (passed to seq_gtf) in the attributes column of gtf, how are attributes separated? Default "; ".

No return, but simulated reads and a simulation info file are written to outdir. Note that reads are written out transcript by transcript and so need to be shuffled if used as input to quantification algorithms such as eXpress or Salmon.

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

McElroy KE, Luciani F and Thomas T (2012): GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

  ## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  set.seed(4)
  fold_change_values = sample(c(0.5, 1, 2), size=2*numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  fold_changes = matrix(fold_change_values, nrow=numtx)
  library(Biostrings)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)