simulate_experiment: simulate RNA-seq experiment using negative binomial model

Description Usage Arguments Details Value References Examples

View source: R/simulate_experiment.R

Description

create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups

Usage

1
2
3
simulate_experiment(fasta = NULL, gtf = NULL, seqpath = NULL,
  outdir = ".", num_reps = c(10, 10), reads_per_transcript = 300,
  size = NULL, fold_changes, paired = TRUE, reportCoverage = FALSE, ...)

Arguments

fasta

path to FASTA file containing transcripts from which to simulate reads. See details.

gtf

path to GTF file containing transcript structures from which reads should be simulated. See details.

seqpath

path to folder containing one FASTA file (.fa extension) for each chromosome in gtf. See details.

outdir

character, path to folder where simulated reads should be written, with *no* slash at the end. By default, reads are written to current working directory.

num_reps

How many biological replicates should be in each group? The length num_reps determines how many groups are in the experiment. For example, num_reps = c(5,6,5) specifies a 3-group experiment with 5 samples in group 1, 6 samples in group 2, and 5 samples in group 3. Defaults to a 2-group experiment with 10 reps per group (i.e., c(10,10)).

reads_per_transcript

baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in fasta. Default 300. You can also leave reads_per_transcript empty and set meanmodel=TRUE to draw baseline mean numbers from a model based on transcript length.

size

the negative binomial size parameter (see NegBinomial) for the number of reads drawn per transcript. It can be a matrix (where the user can specify the size parameter per transcript, per group), a vector (where the user can specify the size per transcript, perhaps relating to reads_per_transcript), or a single number, specifying the size for all transcripts and groups. If left NULL, defaults to reads_per_transcript * fold_changes / 3. Negative binomial variance is mean + mean^2 / size.

fold_changes

Matrix specifying multiplicative fold changes between groups. There is no default, so you must provide this argument. In real data sets, lowly-expressed transcripts often show high fold changes between groups, so this can be kept in mind when setting fold_changes and reads_per_transcript. This argument must have the same number of columns as there are groups as specified by num_reps, and must have the same number of rows as there are transcripts in fasta. A fold change of X in matrix entry i,j means that for replicate j, the baseline mean number of reads (reads_per_transcript[i]) will be multiplied by X. Note that the multiplication happens before the negative binomial value (for the number of reads that *actually will* be drawn from transcript i, for replicate j) is drawn. This argument is ignored if length(num_reps) is 1 (meaning you only have 1 group in your simulation).

paired

If TRUE, paired-end reads are simulated; else single-end reads are simulated. Default TRUE

reportCoverage

whether to write out coverage information to sample_coverages.rda file in the outdir. defaults to FALSE

...

any of several other arguments that can be used to add nuance to the simulation. See details.

Details

Reads can either be simulated from a FASTA file of transcripts (provided with the fasta argument) or from a GTF file plus DNA sequences (provided with the gtf and seqpath arguments). Simulating from a GTF file and DNA sequences may be a bit slower: it took about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X, and Y in hg19.

Several optional parameters can be passed to this function to adjust the simulation. The options are:

Value

No return, but simulated reads and a simulation info file are written to outdir. Note that reads are written out transcript by transcript and so need to be shuffled if used as input to quantification algorithms such as eXpress or Salmon.

References

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

McElroy KE, Luciani F and Thomas T (2012): GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  ## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  set.seed(4)
  fold_change_values = sample(c(0.5, 1, 2), size=2*numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  fold_changes = matrix(fold_change_values, nrow=numtx)
  library(Biostrings)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)

polyester documentation built on Nov. 8, 2020, 8:09 p.m.