View source: R/generator_random.R
generator_random | R Documentation |
Generator generator_fasta_lm
, generator_fasta_label_header_csv
or generator_fasta_label_folder
will randomly choose a consecutive sequence of samples when
a max_samples
argument is supplied. generator_random
will choose samples at random.
generator_random(
train_type = "label_folder",
output_format = NULL,
seed = 123,
format = "fasta",
reverse_complement = TRUE,
path = NULL,
batch_size = c(100),
maxlen = 4,
ambiguous_nuc = "equal",
padding = FALSE,
vocabulary = c("a", "c", "g", "t"),
number_target_nt = 1,
n_gram = NULL,
n_gram_stride = NULL,
sample_by_file_size = TRUE,
max_samples = 1,
skip_amb_nuc = NULL,
vocabulary_label = NULL,
target_from_csv = NULL,
target_split = NULL,
max_iter = 1000,
verbose = TRUE,
set_learning = NULL,
shuffle_input = TRUE,
reverse_complement_encoding = FALSE,
proportion_entries = NULL,
masked_lm = NULL,
concat_seq = NULL,
return_int = FALSE,
reshape_xy = NULL
)
train_type |
Either
| |||||||
output_format |
Determines shape of output tensor for language model.
Either
| |||||||
seed |
Sets seed for | |||||||
format |
File format, either | |||||||
reverse_complement |
Boolean, for every new file decide randomly to use original data or its reverse complement. | |||||||
path |
Path to training data. If | |||||||
batch_size |
Number of samples in one batch. | |||||||
maxlen |
Length of predictor sequence. | |||||||
ambiguous_nuc |
How to handle nucleotides outside vocabulary, either
| |||||||
padding |
Whether to pad sequences too short for one sample with zeros. | |||||||
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in | |||||||
number_target_nt |
Number of target nucleotides for language model. | |||||||
n_gram |
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for | |||||||
n_gram_stride |
Step size for n-gram encoding. For AACCGGTT with | |||||||
sample_by_file_size |
Sample new file weighted by file size (bigger files more likely). | |||||||
max_samples |
Maximum number of samples to use from one file. If not | |||||||
skip_amb_nuc |
Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise. | |||||||
vocabulary_label |
Character vector of possible targets. Targets outside | |||||||
target_from_csv |
Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets. | |||||||
target_split |
If target gets read from csv file, list of names to divide target tensor into list of tensors.
Example: if csv file has header names | |||||||
max_iter |
Stop after | |||||||
verbose |
Whether to show messages. | |||||||
set_learning |
When you want to assign one label to set of samples. Only implemented for
| |||||||
shuffle_input |
Whether to shuffle entries in every fasta/fastq file before extracting samples. | |||||||
reverse_complement_encoding |
Whether to use both original sequence and reverse complement as two input sequences. | |||||||
proportion_entries |
Proportion of fasta entries to keep. For example, if fasta file has 50 entries and | |||||||
masked_lm |
If not
| |||||||
concat_seq |
Character string or | |||||||
return_int |
Whether to return integer encoding or one-hot encoding. | |||||||
reshape_xy |
Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions)
must be called x for input or y for target and each have arguments called x and y. For example:
|
A generator function.
path_input <- tempfile()
dir.create(path_input)
# create 2 fasta files called 'file_1.fasta', 'file_2.fasta'
create_dummy_data(file_path = path_input,
num_files = 2,
seq_length = 5,
num_seq = 1,
vocabulary = c("a", "c", "g", "t"))
dummy_labels <- data.frame(file = c('file_1.fasta', 'file_2.fasta'), # dummy labels
label1 = c(0, 1),
label2 = c(1, 0))
target_from_csv <- tempfile(fileext = '.csv')
write.csv(dummy_labels, target_from_csv, row.names = FALSE)
gen <- generator_random(path = path_input, batch_size = 2,
vocabulary_label = c('label_a', 'label_b'),
train_type = 'label_csv',
maxlen = 5, target_from_csv = target_from_csv)
z <- gen()
dim(z[[1]])
z[[2]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.