illumina | R Documentation |
From either a reference genome or set of variant haplotypes, create Illumina reads
from error profiles and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
illumina(obj,
out_prefix,
n_reads,
read_length,
paired,
frag_mean = 400,
frag_sd = 100,
matepair = FALSE,
seq_sys = NULL,
profile1 = NULL,
profile2 = NULL,
ins_prob1 = 0.00009,
del_prob1 = 0.00011,
ins_prob2 = 0.00015,
del_prob2 = 0.00023,
frag_len_min = NULL,
frag_len_max = NULL,
haplotype_probs = NULL,
barcodes = NULL,
prob_dup = 0.02,
sep_files = FALSE,
compress = FALSE,
comp_method = "bgzip",
n_threads = 1L,
read_pool_size = 1000L,
show_progress = FALSE,
overwrite = FALSE)
obj |
Sequencing object of class |
out_prefix |
Prefix for the output file(s), including entire path except for the file extension. |
n_reads |
Number of reads you want to create. |
read_length |
Length of reads. |
paired |
Logical for whether to use paired-end reads.
This argument is changed to |
frag_mean |
Mean of the Gamma distribution that generates fragment sizes.
Defaults to |
frag_sd |
Standard deviation of the Gamma distribution that generates
fragment sizes.
Defaults to |
matepair |
Logical for whether to simulate mate-pair reads.
Defaults to |
seq_sys |
Full or abbreviated name of sequencing system to use.
See "Sequencing systems" section for options.
See "Sequencing profiles" section for more information on how this argument,
|
profile1 |
Custom profile file for read 1.
See "Sequencing profiles" section for more information on how this argument,
|
profile2 |
Custom profile file for read 2.
See "Sequencing profiles" section for more information on how this argument,
|
ins_prob1 |
Insertion probability for read 1. Defaults to |
del_prob1 |
Deletion probability for read 1. Defaults to |
ins_prob2 |
Insertion probability for read 2. Defaults to |
del_prob2 |
Deletion probability for read 2. Defaults to |
frag_len_min |
Minimum fragment size. A |
frag_len_max |
Maximum fragment size.
A |
haplotype_probs |
Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
|
barcodes |
Character vector of barcodes for each haplotype, or a single barcode
if sequencing a reference genome. |
prob_dup |
A single number indicating the probability of duplicates.
Defaults to |
sep_files |
Logical indicating whether to make separate files for each haplotype.
This argument is coerced to |
compress |
Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If |
comp_method |
Character specifying which type of compression to use if any
is desired. Options include |
n_threads |
The number of threads to use in processing.
If |
read_pool_size |
The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to |
show_progress |
Logical for whether to show a progress bar.
Defaults to |
overwrite |
Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist. |
Nothing is returned.
This section outlines how to use the seq_sys
, profile1
,
and profile2
arguments.
If all arguments are NULL
(their defaults), a sequencing system is chosen
based on the read length.
If, however, one or more arguments has been provided, then how they're provided
should depend on whether you want single- or paired-end reads.
For single-end reads
profile2
should be NULL
.
Only seq_sys
or profile1
should be provided, not both.
For paired-end reads
If providing seq_sys
, don't provide either profile1
or profile2
.
If providing profile1
, you must also provide profile2
(they can be the
same if you want) and you cannot provide seq_sys
.
Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.
name | abbrev | max_len | paired |
Genome Analyzer I | GA1 | 44 | Yes |
Genome Analyzer II | GA2 | 75 | Yes |
HiSeq 1000 | HS10 | 100 | Yes |
HiSeq 2000 | HS20 | 100 | Yes |
HiSeq 2500 | HS25 | 150 | Yes |
HiSeqX v2.5 PCR free | HSXn | 150 | Yes |
HiSeqX v2.5 TruSeq | HSXt | 150 | Yes |
MiniSeq TruSeq | MinS | 50 | No |
MiSeq v1 | MSv1 | 250 | Yes |
MiSeq v3 | MSv3 | 250 | Yes |
NextSeq 500 v2 | NS50 | 75 | Yes |
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]
where the part in []
is only for paired-end Illumina reads, and where genome name
is always REF
for reference genomes (as opposed to haplotypes).
Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100,
read_length = 100, paired = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.