View source: R/generator_folder.R
generator_fasta_label_folder | R Documentation |
Iterates over folder containing fasta/fastq files and produces encoding of predictor sequences
and target variables. Files in path_corpus
should all belong to one class.
generator_fasta_label_folder(
path_corpus,
format = "fasta",
batch_size = 256,
maxlen = 250,
max_iter = 10000,
vocabulary = c("a", "c", "g", "t"),
verbose = FALSE,
shuffle_file_order = FALSE,
step = 1,
seed = 1234,
shuffle_input = FALSE,
file_limit = NULL,
path_file_log = NULL,
reverse_complement = TRUE,
reverse_complement_encoding = FALSE,
num_targets,
ones_column,
ambiguous_nuc = "zero",
proportion_per_seq = NULL,
read_data = FALSE,
use_quality_score = FALSE,
padding = TRUE,
added_label_path = NULL,
add_input_as_seq = NULL,
skip_amb_nuc = NULL,
max_samples = NULL,
concat_seq = NULL,
file_filter = NULL,
use_coverage = NULL,
proportion_entries = NULL,
sample_by_file_size = FALSE,
n_gram = NULL,
n_gram_stride = 1,
masked_lm = NULL,
add_noise = NULL,
return_int = FALSE,
reshape_xy = NULL
)
path_corpus |
Input directory where fasta files are located or path to single file ending with fasta or fastq (as specified in format argument). Can also be a list of directories and/or files. |
format |
File format, either |
batch_size |
Number of samples in one batch. |
maxlen |
Length of predictor sequence. |
max_iter |
Stop after |
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in |
verbose |
Whether to show messages. |
shuffle_file_order |
Logical, whether to go through files randomly or sequentially. |
step |
How often to take a sample. |
seed |
Sets seed for |
shuffle_input |
Whether to shuffle entries in every fasta/fastq file before extracting samples. |
file_limit |
Integer or |
path_file_log |
Write name of files to csv file if path is specified. |
reverse_complement |
Boolean, for every new file decide randomly to use original data or its reverse complement. |
reverse_complement_encoding |
Whether to use both original sequence and reverse complement as two input sequences. |
num_targets |
Number of columns of target matrix. |
ones_column |
Which column of target matrix contains ones. |
ambiguous_nuc |
How to handle nucleotides outside vocabulary, either
|
proportion_per_seq |
Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence). |
read_data |
If |
use_quality_score |
Whether to use fastq quality scores. If TRUE input is not one-hot-encoding but corresponds to probabilities. For example (0.97, 0.01, 0.01, 0.01) instead of (1, 0, 0, 0). |
padding |
Whether to pad sequences too short for one sample with zeros. |
added_label_path |
Path to file with additional input labels. Should be a csv file with one column named "file". Other columns should correspond to labels. |
add_input_as_seq |
Boolean vector specifying for each entry in |
skip_amb_nuc |
Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise. |
max_samples |
Maximum number of samples to use from one file. If not |
concat_seq |
Character string or |
file_filter |
Vector of file names to use from path_corpus. |
use_coverage |
Integer or |
proportion_entries |
Proportion of fasta entries to keep. For example, if fasta file has 50 entries and |
sample_by_file_size |
Sample new file weighted by file size (bigger files more likely). |
n_gram |
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for |
n_gram_stride |
Step size for n-gram encoding. For AACCGGTT with |
masked_lm |
If not
|
add_noise |
|
return_int |
Whether to return integer encoding or one-hot encoding. |
reshape_xy |
Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions)
must be called x for input or y for target and each have arguments called x and y. For example:
|
A generator function.
# create dummy fasta files
path_input_1 <- tempfile()
dir.create(path_input_1)
create_dummy_data(file_path = path_input_1,
num_files = 2,
seq_length = 7,
num_seq = 1,
vocabulary = c("a", "c", "g", "t"))
gen <- generator_fasta_label_folder(path_corpus = path_input_1, batch_size = 2,
num_targets = 3, ones_column = 2, maxlen = 7)
z <- gen()
dim(z[[1]])
z[[2]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.