Description Usage Arguments Details Value References Examples
View source: R/simulate_experiment.R
create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups
1 2 3 |
fasta |
path to FASTA file containing transcripts from which to simulate reads. See details. |
gtf |
path to GTF file containing transcript structures from which reads should be simulated. See details. |
seqpath |
path to folder containing one FASTA file ( |
outdir |
character, path to folder where simulated reads should be written, with *no* slash at the end. By default, reads are written to current working directory. |
num_reps |
How many biological replicates should be in each group? The
length |
reads_per_transcript |
baseline mean number of reads to simulate
from each transcript. Can be an integer, in which case this many reads
are simulated from each transcript, or an integer vector whose length
matches the number of transcripts in |
size |
the negative binomial |
fold_changes |
Matrix specifying multiplicative fold changes
between groups. There is no default, so you must provide this argument.
In real data sets, lowly-expressed transcripts often show high fold
changes between groups, so this can be kept in mind when setting
|
paired |
If |
... |
any of several other arguments that can be used to add nuance to the simulation. See details. |
Reads can either be simulated from a FASTA file of transcripts
(provided with the fasta
argument) or from a GTF file plus DNA
sequences (provided with the gtf
and seqpath
arguments).
Simulating from a GTF file and DNA sequences may be a bit slower: it took
about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X,
and Y in hg19.
Several optional parameters can be passed to this function to adjust the simulation. The options are:
readlen
: read length. Default 100.
lib_sizes
: Library size factors for the biological replicates.
lib_sizes
should have length equal to the total number of
replicates in the experiment, i.e., sum(num_reps)
. For each
replicate, once the number of reads to simulate from each transcript for
that replicate is known, all read numbers across all transcripts from that
replicate are multiplied by the corresponding entry in lib_sizes
.
distr
One of 'normal', 'empirical', or 'custom', which
specifies the distribution from which to draw RNA fragment lengths. If
'normal', draw fragment lengths from a normal distribution. You can provide
the mean of that normal distribution with fraglen
(defaults to 250)
and the standard deviation of that normal distribution with fragsd
(defaults to 25). If 'empirical', draw fragment lengths
from a fragment length distribution estimated from a real data set. If
'custom', draw fragment lengths from a custom distribution, which you can
provide as the custdens
argument. custdens
should be a
density fitted using logspline
.
error_model
: The error model can be one of:
'uniform'
: errors are distributed uniformly across reads.
You can also provide an 'error_rate'
parameter, giving the overall
probability of making a sequencing error at any given nucleotide. This
error rate defaults to 0.005.
'illumina4'
or 'illumina5'
: Empirical error models.
See ?add_platform_error
for more information.
'custom'
: A custom error model you've estimated from an
RNA-seq data set using GemErr
. See ?add_platform_error
for more info. You will need to provide both model_path
and
model_prefix
if using a custom error model. model_path
is
the output folder you provided to build_error_model.py
. This path
should contain either two files suffixed _mate1 and _mate2, or a file
suffixed _single. model_prefix
is the 'prefix' argument you
provided to build_error_model.py
and is whatever comes before the
_mate1/_mate2 or _single files in model_path
.
bias
One of 'none', 'rnaf', or 'cdnaf'. 'none'
represents uniform fragment selection (every possible fragment in a
transcript has equal probability of being in the experiment); 'rnaf'
represents positional bias that arises in protocols using RNA
fragmentation, and 'cdnaf' represents positional bias arising in protocols
that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model,
coverage is higher in the middle of the transcript and lower at both ends,
and in the 'cdnaf' model, coverage increases toward the 3' end of the
transcript. The probability models used come from Supplementary Figure S3
of Li and Jiang (2012). Defaults to 'none' if you don't provide this.
gcbias
list indicating which samples to add GC bias to, and
from which models. Should be the same length as sum(num_reps)
;
entries can be either numeric or of class loess
. A numeric entry of
0 indicates no GC bias. Numeric entries 1 through 7 correspond to the
7 empirical GC models that ship with Polyester, estimated from GEUVADIS
HapMap samples NA06985, NA12144, NA12776, NA18858, NA20542, NA20772,
and NA20815, respectively. The code used to derive the empirical GC models
is available at
https://github.com/alyssafrazee/polyester/blob/master/make_gc_bias.R.
A loess entry should be a loess prediction model
that takes a GC content percent value (between 0 and 1) a transcript's
deviation from overall mean read count based on that GC value. Counts for
each replicate will be adjusted based on the GC bias model specified for
it. Numeric and loess entries can be mixed. By default, no bias is
included.
meanmodel
: set to TRUE if you'd like to set
reads_per_transcripts
as a function of transcript length. We
fit a linear model regressing transcript abundance on transcript length,
and setting meanmodel=TRUE
means we will use transcript lengths
to draw transcript abundance based on that linear model. You can see our
modeling code at http://htmlpreview.github.io/?https://github.com/alyssafrazee/polyester_code/blob/master/length_simulation.html
write_info
: set to FALSE if you do not want files of
simulation information written to disk. By default, transcript fold
changes and expression status & replicate library sizes and group
identifiers are written to outdir
.
seed
: specify a seed (e.g. seed=142
or some other
integer) to set before randomly drawing read numbers, for reproducibility.
transcriptid
: optional vector of transcript IDs to be written
into sim_info.txt
and used as transcript identifiers in the output
fasta files. Defaults to names(readDNAStringSet(fasta))
. This
option is useful if default names are very long or contain special
characters.
You can also include other parameters to pass to
seq_gtf
if you're simulating from a GTF file.
fragGCBias
: (new argument) either NULL or a function
The function should
map from GC content (in [0,1]) to the probability of observing a fragment,
based on the fragment's GC content. A coin is flipped inside
generate_fragments
and only fragments which pass the coin
flip are returned.
This should be used INSTEAD of gcbias
above.
fragGCBiasData
: (new argument) a list of data which will be
used within fragGCBias
. should be as long as the number of samples.
No return, but simulated reads and a simulation info file are written
to outdir
.
't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.
Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.
McElroy KE, Luciani F and Thomas T (2012): GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## simulate a few reads from chromosome 22
fastapath = system.file("extdata", "chr22.fa", package="polyester")
numtx = count_transcripts(fastapath)
set.seed(4)
fold_changes = sample(c(0.5, 1, 2), size=numtx,
prob=c(0.05, 0.9, 0.05), replace=TRUE)
library(Biostrings)
# remove quotes from transcript IDs:
tNames = gsub("'", "", names(readDNAStringSet(fastapath)))
simulate_experiment(fastapath, reads_per_transcript=10,
fold_changes=fold_changes, outdir='simulated_reads',
transcriptid=tNames, seed=12)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.