get_generator | R Documentation |
For a detailed description see the data generator tutorial.
Will choose one of the generators from generator_fasta_lm
,
generator_fasta_label_folder
, generator_fasta_label_header_csv
,
generator_rds
, generator_random
, generator_dummy
or
generator_fasta_lm
according to the train_type
and random_sampling
arguments.
get_generator(
path = NULL,
train_type,
batch_size,
maxlen,
step = NULL,
shuffle_file_order = FALSE,
vocabulary = c("A", "C", "G", "T"),
seed = 1,
proportion_entries = NULL,
shuffle_input = FALSE,
format = "fasta",
path_file_log = NULL,
reverse_complement = FALSE,
n_gram = NULL,
n_gram_stride = NULL,
output_format = "target_right",
ambiguous_nuc = "zero",
proportion_per_seq = NULL,
skip_amb_nuc = NULL,
use_quality_score = FALSE,
padding = FALSE,
added_label_path = NULL,
target_from_csv = NULL,
add_input_as_seq = NULL,
max_samples = NULL,
concat_seq = NULL,
target_len = 1,
file_filter = NULL,
use_coverage = NULL,
sample_by_file_size = FALSE,
add_noise = NULL,
random_sampling = FALSE,
set_learning = NULL,
file_limit = NULL,
reverse_complement_encoding = FALSE,
read_data = FALSE,
target_split = NULL,
path_file_logVal = NULL,
model = NULL,
vocabulary_label = NULL,
masked_lm = NULL,
val = FALSE,
return_int = FALSE,
verbose = TRUE,
delete_used_files = FALSE,
reshape_xy = NULL
)
path |
Path to training data. If | |||||||
train_type |
Either
| |||||||
batch_size |
Number of samples used for one network update. | |||||||
maxlen |
Length of predictor sequence. | |||||||
step |
Frequency of sampling steps. | |||||||
shuffle_file_order |
Boolean, whether to go through files sequentially or shuffle beforehand. | |||||||
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in | |||||||
seed |
Sets seed for reproducible results. | |||||||
proportion_entries |
Proportion of fasta entries to keep. For example, if fasta file has 50 entries and | |||||||
shuffle_input |
Whether to shuffle entries in file. | |||||||
format |
File format, | |||||||
path_file_log |
Write name of files used for training to csv file if path is specified. | |||||||
reverse_complement |
Boolean, for every new file decide randomly to use original data or its reverse complement. | |||||||
n_gram |
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for | |||||||
n_gram_stride |
Step size for n-gram encoding. For AACCGGTT with | |||||||
output_format |
Determines shape of output tensor for language model.
Either
| |||||||
ambiguous_nuc |
How to handle nucleotides outside vocabulary, either
| |||||||
proportion_per_seq |
Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence). | |||||||
skip_amb_nuc |
Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise. | |||||||
use_quality_score |
Whether to use fastq quality scores. If | |||||||
padding |
Whether to pad sequences too short for one sample with zeros. | |||||||
added_label_path |
Path to file with additional input labels. Should be a csv file with one column named "file". Other columns should correspond to labels. | |||||||
target_from_csv |
Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets. | |||||||
add_input_as_seq |
Boolean vector specifying for each entry in | |||||||
max_samples |
Maximum number of samples to use from one file. If not | |||||||
concat_seq |
Character string or | |||||||
target_len |
Number of nucleotides to predict at once for language model. | |||||||
file_filter |
Vector of file names to use from path_corpus. | |||||||
use_coverage |
Integer or | |||||||
sample_by_file_size |
Sample new file weighted by file size (bigger files more likely). | |||||||
add_noise |
| |||||||
random_sampling |
Whether samples should be taken from random positions when using | |||||||
set_learning |
When you want to assign one label to set of samples. Only implemented for
| |||||||
file_limit |
Integer or | |||||||
reverse_complement_encoding |
Whether to use both original sequence and reverse complement as two input sequences. | |||||||
read_data |
If | |||||||
target_split |
If target gets read from csv file, list of names to divide target tensor into list of tensors.
Example: if csv file has header names | |||||||
path_file_logVal |
Path to csv file logging used validation files. | |||||||
model |
A keras model. | |||||||
vocabulary_label |
Character vector of possible targets. Targets outside | |||||||
masked_lm |
If not
| |||||||
val |
Logical, call initialized generator "genY" or "genValY" where Y is an integer between 1 and length of directories. | |||||||
return_int |
Whether to return integer encoding or one-hot encoding. | |||||||
verbose |
Whether to show messages. | |||||||
delete_used_files |
Whether to delete file once used. Only applies for rds files. | |||||||
reshape_xy |
Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions)
must be called x for input or y for target and each have arguments called x and y. For example:
|
A generator function.
# create dummy fasta files
fasta_path <- tempfile()
dir.create(fasta_path)
create_dummy_data(file_path = fasta_path,
num_files = 3,
seq_length = 10,
num_seq = 5,
vocabulary = c("a", "c", "g", "t"))
gen <- get_generator(path = fasta_path,
maxlen = 5, train_type = "lm",
output_format = "target_right",
step = 3, batch_size = 7)
z <- gen()
x <- z[[1]]
y <- z[[2]]
dim(x)
dim(y)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.