fastaLabelGenerator: Custom generator for fasta files and label targets

Description Usage Arguments Value

View source: R/generators.R

Description

fastaLabelGenerator Iterates over folder containing .fasta files and produces one-hot-encoding of predictor sequences and target variables. Targets will be read from fasta headers.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
fastaLabelGenerator(
  corpus.dir,
  format = "fasta",
  batch.size = 256,
  maxlen = 250,
  max_iter = 10000,
  vocabulary = c("a", "c", "g", "t"),
  verbose = FALSE,
  randomFiles = FALSE,
  step = 1,
  showWarnings = FALSE,
  seed = 1234,
  shuffleFastaEntries = FALSE,
  numberOfFiles = NULL,
  fileLog = NULL,
  labelVocabulary = c("x", "y", "z"),
  reverseComplements = TRUE
)

Arguments

corpus.dir

Input directory where .fasta files are located or path to single file ending with .fasta or .fastq (as specified in format argument).

format

File format, either fasta or fastq.

batch.size

Number of batches.

maxlen

Length of predictor sequence.

max_iter

Stop after max_iter number of iterations failed to produce a new batch.

vocabulary

Vector of allowed characters, character outside vocabulary get encoded as 0-vector.

verbose

Whether to show message.

randomFiles

Logical, whether to go through files randomly or sequential.

step

How often to take a sample.

showWarnings

Logical, give warning if character outside vocabulary appears.

seed

Sets seed for set.seed function, for reproducible results when using randomFiles or shuffleFastaEntries

shuffleFastaEntries

Logical, shuffle fasta entries.

numberOfFiles

Use only specified number of files, ignored if greater than number of files in corpus.dir.

fileLog

Write name of files to csv file if path is specified.

labelVocabulary

Character vector of possible targets. Targets outside labelVocabulary will get discarded.

reverseComplements

Logical, half of batch contains sequences and other its reverse complements. Reverse complement is given by reversed order of sequence and switching A/T and C/G. batch.size argument has to be even, otherwise 1 will be added to batch.size

Value

A list of length 2. First element is a 3-dimensional tensor with dimensions (batch.size, maxlen, length(vocabulary)), encoding the predictor sequences. Second element is a matrix with dimensions (batch.size, length(vocabulary)), encoding the targets.


hiddengenome/altum documentation built on April 22, 2020, 9:33 p.m.