View source: R/generator_folder_collect.R
generator_initialize | R Documentation |
generator_fasta_label_folder
functionInitializes generators defined by generator_fasta_label_folder
function. Targets get encoded in order of directories.
Number of classes is given by length of directories
.
generator_initialize(
directories,
format = "fasta",
batch_size = 256,
maxlen = 250,
max_iter = 10000,
vocabulary = c("a", "c", "g", "t"),
verbose = FALSE,
shuffle_file_order = FALSE,
step = 1,
seed = 1234,
shuffle_input = FALSE,
file_limit = NULL,
path_file_log = NULL,
reverse_complement = FALSE,
reverse_complement_encoding = FALSE,
val = FALSE,
ambiguous_nuc = "zero",
proportion_per_seq = NULL,
target_middle = FALSE,
read_data = FALSE,
use_quality_score = FALSE,
padding = TRUE,
added_label_path = NULL,
add_input_as_seq = NULL,
skip_amb_nuc = NULL,
max_samples = NULL,
file_filter = NULL,
concat_seq = NULL,
use_coverage = NULL,
set_learning = NULL,
proportion_entries = NULL,
sample_by_file_size = FALSE,
n_gram = NULL,
n_gram_stride = 1,
add_noise = NULL,
return_int = FALSE,
reshape_xy = NULL
)
directories |
Vector of paths to folder containing fasta files. Files in one folder should belong to one class. |
format |
File format, either |
batch_size |
Number of samples in one batch. |
maxlen |
Length of predictor sequence. |
max_iter |
Stop after |
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in |
verbose |
Whether to show messages. |
shuffle_file_order |
Logical, whether to go through files randomly or sequentially. |
step |
How often to take a sample. |
seed |
Sets seed for |
shuffle_input |
Whether to shuffle entries in every fasta/fastq file before extracting samples. |
file_limit |
Integer or |
path_file_log |
Write name of files to csv file if path is specified. |
reverse_complement |
Boolean, for every new file decide randomly to use original data or its reverse complement. |
reverse_complement_encoding |
Whether to use both original sequence and reverse complement as two input sequences. |
val |
Logical, call initialized generator "genY" or "genValY" where Y is an integer between 1 and length of directories. |
ambiguous_nuc |
How to handle nucleotides outside vocabulary, either
|
proportion_per_seq |
Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence). |
target_middle |
Split input sequence into two sequences while removing nucleotide in middle. If input is x_1,..., x_(n+1), input gets split into input_1 = x_1,..., x_m and input_2 = x_(n+1),..., x_(m+2) where m = ceiling((n+1)/2) and n = maxlen. Note that x_(m+1) is not used. |
read_data |
If true the first element of output is a list of length 2, each containing one part of paired read. |
use_quality_score |
Whether to use fastq quality scores. If TRUE input is not one-hot-encoding but corresponds to probabilities. For example (0.97, 0.01, 0.01, 0.01) instead of (1, 0, 0, 0). |
padding |
Whether to pad sequences too short for one sample with zeros. |
added_label_path |
Path to file with additional input labels. Should be a csv file with one column named "file". Other columns should correspond to labels. |
add_input_as_seq |
Boolean vector specifying for each entry in |
skip_amb_nuc |
Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise. |
max_samples |
Maximum number of samples to use from one file. If not |
file_filter |
Vector of file names to use from path_corpus. |
concat_seq |
Character string or |
use_coverage |
Integer or |
set_learning |
When you want to assign one label to set of samples. Only implemented for
|
proportion_entries |
Proportion of fasta entries to keep. For example, if fasta file has 50 entries and |
sample_by_file_size |
Sample new file weighted by file size (bigger files more likely). |
n_gram |
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for |
n_gram_stride |
Step size for n-gram encoding. For AACCGGTT with |
add_noise |
|
return_int |
Whether to return integer encoding or one-hot encoding. |
reshape_xy |
Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions)
must be called x for input or y for target and each have arguments called x and y. For example:
|
List of generator function.
# create two folders with dummy fasta files
path_input_1 <- tempfile()
dir.create(path_input_1)
create_dummy_data(file_path = path_input_1, num_files = 2, seq_length = 5,
num_seq = 2, vocabulary = c("a", "c", "g", "t"))
path_input_2 <- tempfile()
dir.create(path_input_2)
create_dummy_data(file_path = path_input_2, num_files = 3, seq_length = 7,
num_seq = 5, vocabulary = c("a", "c", "g", "t"))
gen_list <- generator_initialize(directories = c(path_input_1, path_input_1),
batch_size = 4, maxlen = 5)
z1 <- gen_list[[1]]()
z1[[1]]
z1[[2]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.