simulate_experiment: simulate RNA-seq experiment using negative binomial model
In alyssafrazee/polyester-release: Simulate RNA-seq reads

Description Usage Arguments Details Value Examples

create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups

simulate_experiment(fasta = NULL, gtf = NULL, seqpath = NULL,
  num_reps = 10, fraglen = 250, fragsd = 25, readlen = 100,
  error_rate = 0.005, paired = TRUE, reads_per_transcript = 300,
  fold_changes, size = NULL, outdir = ".", write_info = TRUE,
  transcriptid = NULL, seed = NULL, ...)

`fasta`	path to FASTA file containing transcripts from which to simulate reads. See details.
`gtf`	path to GTF file containing transcript structures from which reads should be simulated. See details.
`seqpath`	path to folder containing one FASTA file (`.fa` extension) for each chromosome in `gtf`. See details.
`num_reps`	How many biological replicates should be in each group? If `num_reps` is a single integer, `num_reps` replicates will be simulated in each group. Otherwise, `num_reps` can be a length-2 vector, where `num_reps[1]` and `num_reps[2]` replicates will be simulated in each of the two groups.
`fraglen`	Mean RNA fragment length. Sequences will be read off the end(s) of these fragments.
`fragsd`	Standard deviation of fragment lengths.
`readlen`	Read length.
`error_rate`	Sequencing error rate. Must be between 0 and 1. A uniform error model is assumed.
`paired`	If `TRUE`, paired-end reads are simulated; else single-end reads are simulated.
`reads_per_transcript`	baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in `fasta`.
`fold_changes`	Vector of multiplicative fold changes between groups, one entry per transcript in `fasta`. A fold change > 1 means the transcript is overexpressed in the first `num_reps` (or `num_reps[1]`) samples. Fold change < 1 means transcript is overexpressed in the last `num_reps` (or `num_reps[2]`) samples. The change is in the mean number of reads generated from the transcript, between groups.
`size`	the negative binomial `size` parameter (see `NegBinomial`) for the number of reads drawn per transcript. If left blank, defaults to `reads_per_transcript / 3`. Negative binomial variance is mean + mean^2 / size. Can either be left at default, a vector of the same length as number of transcripts in `fasta`, if the two groups should have the same size parameters, or a list with 2 elements, each of which is a vector with length equal to the number of transcripts in `fasta`, which represent the size parameters for each transcript in groups 1 and 2, respectively.
`outdir`	character, path to folder where simulated reads should be written, with no slash at the end. By default, reads are written to current working directory.
`write_info`	If `TRUE`, write a file matching transcript IDs to differential expression status into the file `outdir/sim_info.txt`.
`transcriptid`	optional vector of transcript IDs to be written into `sim_info.txt` and used as transcript identifiers in the fasta files. Defaults to `names(readDNAStringSet(fasta))`. This option is useful if default names are very long or contain special characters.
`seed`	Optional seed to set before simulating reads, for reproducibility.
`...`	additional arguments to pass to `seq_gtf` if using `gtf` and `seqpath`

Reads can either be simulated from a FASTA file of transcripts (provided with the fasta argument) or from a GTF file plus DNA sequences (provided with the gtf and seqpath arguments). Simulating from a GTF file and DNA sequences may be a bit slower: it took about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X, and Y in hg19.

No return, but simulated reads and a simulation info file are written to outdir.

  ## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  set.seed(4)
  fold_changes = sample(c(0.5, 1, 2), size=numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  library(Biostrings)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)