View source: R/generator_utils.R
dataset_from_gen | R Documentation |
Repeatedly generate samples with data generator and store output. Creates a separate rds or pickle file in output_path
for each
batch.
dataset_from_gen(
output_path,
iterations = 10,
train_type = "lm",
output_format = "target_right",
path_corpus,
batch_size = 32,
maxlen = 250,
step = NULL,
vocabulary = c("a", "c", "g", "t"),
shuffle = FALSE,
set_learning = NULL,
seed = NULL,
random_sampling = FALSE,
store_format = "rds",
file_name_start = "batch_",
masked_lm = NULL,
...
)
output_path |
Output directory. Output files will be named | |||||||
iterations |
Number of batches (output files) to create. | |||||||
train_type |
Either
| |||||||
output_format |
Determines shape of output tensor for language model.
Either
| |||||||
path_corpus |
Input directory where fasta files are located or path to single file ending with fasta or fastq (as specified in format argument). Can also be a list of directories and/or files. | |||||||
batch_size |
Number of samples in one batch. | |||||||
maxlen |
Length of predictor sequence. | |||||||
step |
How often to take a sample. | |||||||
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in | |||||||
shuffle |
Whether to shuffle samples within each batch. | |||||||
set_learning |
When you want to assign one label to set of samples. Only implemented for
| |||||||
seed |
Sets seed for | |||||||
random_sampling |
Whether samples should be taken from random positions when using | |||||||
store_format |
Either "rds" or "pickle". | |||||||
file_name_start |
Start of output file names. | |||||||
masked_lm |
If not
| |||||||
... |
further generator options. See |
None. Function writes data to files and does not return a value.
# create dummy fasta files
temp_dir <- tempfile()
dir.create(temp_dir)
create_dummy_data(file_path = temp_dir,
num_files = 3,
seq_length = 8,
num_seq = 2)
# extract samples
out_dir <- tempfile()
dir.create(out_dir)
dataset_from_gen(output_path = out_dir,
iterations = 10,
train_type = "lm",
output_format = "target_right",
path_corpus = temp_dir,
batch_size = 32,
maxlen = 5,
step = 1,
file_name_start = "batch_")
list.files(out_dir)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.