| get_generator | R Documentation | 
For a detailed description see the data generator tutorial.
Will choose one of the generators from generator_fasta_lm,
generator_fasta_label_folder, generator_fasta_label_header_csv,
generator_rds, generator_random, generator_dummy or
generator_fasta_lm according to the train_type and random_sampling
arguments.
get_generator(
  path = NULL,
  train_type,
  batch_size,
  maxlen,
  step = NULL,
  shuffle_file_order = FALSE,
  vocabulary = c("A", "C", "G", "T"),
  seed = 1,
  proportion_entries = NULL,
  shuffle_input = FALSE,
  format = "fasta",
  path_file_log = NULL,
  reverse_complement = FALSE,
  n_gram = NULL,
  n_gram_stride = NULL,
  output_format = "target_right",
  ambiguous_nuc = "zero",
  proportion_per_seq = NULL,
  skip_amb_nuc = NULL,
  use_quality_score = FALSE,
  padding = FALSE,
  added_label_path = NULL,
  target_from_csv = NULL,
  add_input_as_seq = NULL,
  max_samples = NULL,
  concat_seq = NULL,
  target_len = 1,
  file_filter = NULL,
  use_coverage = NULL,
  sample_by_file_size = FALSE,
  add_noise = NULL,
  random_sampling = FALSE,
  set_learning = NULL,
  file_limit = NULL,
  reverse_complement_encoding = FALSE,
  read_data = FALSE,
  target_split = NULL,
  path_file_logVal = NULL,
  model = NULL,
  vocabulary_label = NULL,
  masked_lm = NULL,
  val = FALSE,
  return_int = FALSE,
  verbose = TRUE,
  delete_used_files = FALSE,
  reshape_xy = NULL
)
| path | Path to training data. If  | |||||||
| train_type | Either  
 | |||||||
| batch_size | Number of samples used for one network update. | |||||||
| maxlen | Length of predictor sequence. | |||||||
| step | Frequency of sampling steps. | |||||||
| shuffle_file_order | Boolean, whether to go through files sequentially or shuffle beforehand. | |||||||
| vocabulary | Vector of allowed characters. Characters outside vocabulary get encoded as specified in  | |||||||
| seed | Sets seed for reproducible results. | |||||||
| proportion_entries | Proportion of fasta entries to keep. For example, if fasta file has 50 entries and  | |||||||
| shuffle_input | Whether to shuffle entries in file. | |||||||
| format | File format,  | |||||||
| path_file_log | Write name of files used for training to csv file if path is specified. | |||||||
| reverse_complement | Boolean, for every new file decide randomly to use original data or its reverse complement. | |||||||
| n_gram | Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for  | |||||||
| n_gram_stride | Step size for n-gram encoding. For AACCGGTT with  | |||||||
| output_format | Determines shape of output tensor for language model.
Either  
 | |||||||
| ambiguous_nuc | How to handle nucleotides outside vocabulary, either  
 | |||||||
| proportion_per_seq | Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence). | |||||||
| skip_amb_nuc | Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise. | |||||||
| use_quality_score | Whether to use fastq quality scores. If  | |||||||
| padding | Whether to pad sequences too short for one sample with zeros. | |||||||
| added_label_path | Path to file with additional input labels. Should be a csv file with one column named "file". Other columns should correspond to labels. | |||||||
| target_from_csv | Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets. | |||||||
| add_input_as_seq | Boolean vector specifying for each entry in  | |||||||
| max_samples | Maximum number of samples to use from one file. If not  | |||||||
| concat_seq | Character string or  | |||||||
| target_len | Number of nucleotides to predict at once for language model. | |||||||
| file_filter | Vector of file names to use from path_corpus. | |||||||
| use_coverage | Integer or  | |||||||
| sample_by_file_size | Sample new file weighted by file size (bigger files more likely). | |||||||
| add_noise | 
 | |||||||
| random_sampling | Whether samples should be taken from random positions when using  | |||||||
| set_learning | When you want to assign one label to set of samples. Only implemented for  
 | |||||||
| file_limit | Integer or  | |||||||
| reverse_complement_encoding | Whether to use both original sequence and reverse complement as two input sequences. | |||||||
| read_data | If  | |||||||
| target_split | If target gets read from csv file, list of names to divide target tensor into list of tensors.
Example: if csv file has header names  | |||||||
| path_file_logVal | Path to csv file logging used validation files. | |||||||
| model | A keras model. | |||||||
| vocabulary_label | Character vector of possible targets. Targets outside  | |||||||
| masked_lm | If not  
 | |||||||
| val | Logical, call initialized generator "genY" or "genValY" where Y is an integer between 1 and length of directories. | |||||||
| return_int | Whether to return integer encoding or one-hot encoding. | |||||||
| verbose | Whether to show messages. | |||||||
| delete_used_files | Whether to delete file once used. Only applies for rds files. | |||||||
| reshape_xy | Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions)
must be called x for input or y for target and each have arguments called x and y. For example:
 | 
A generator function.
# create dummy fasta files
fasta_path <- tempfile()
dir.create(fasta_path)
create_dummy_data(file_path = fasta_path,
                  num_files = 3,
                  seq_length = 10,
                  num_seq = 5,
                  vocabulary = c("a", "c", "g", "t"))
gen <- get_generator(path = fasta_path,
                     maxlen = 5, train_type = "lm",
                     output_format = "target_right",
                     step = 3, batch_size = 7)
z <- gen()
x <- z[[1]]
y <- z[[2]]
dim(x)
dim(y)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.