fastaFileGenerator: Custom generator for fasta/fastq files

Description Usage Arguments Value

View source: R/generators.R

Description

fastaFileGenerator Iterates over folder containing .fasta/.fastq files and produces one-hot-encoding of predictor sequences and target variables.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fastaFileGenerator(
  corpus.dir,
  format = "fasta",
  batch.size = 256,
  maxlen = 250,
  max_iter = 2000,
  vocabulary = c("a", "c", "g", "t"),
  verbose = FALSE,
  randomFiles = FALSE,
  step = 1,
  showWarnings = FALSE,
  seed = 1234,
  shuffleFastaEntries = FALSE,
  numberOfFiles = NULL,
  fileLog = NULL,
  reverseComplements = FALSE
)

Arguments

corpus.dir

Input directory where .fasta files are located or path to single file ending with .fasta or .fastq (as specified in format argument).

format

File format, either fasta or fastq.

batch.size

Number of batches.

maxlen

Length of predictor sequence.

max_iter

Stop after max_iter number of iterations failed to produce a new batch.

vocabulary

Vector of allowed characters, character outside vocabulary get encoded as 0-vector.

verbose

Whether to show message.

randomFiles

Logical, whether to go through files randomly or sequential.

step

How often to take a sample.

showWarnings

Logical, give warning if character outside vocabulary appears

seed

Sets seed for set.seed function, for reproducible results when using randomFiles or shuffleFastaEntries

shuffleFastaEntries

Logical, shuffle entries in every fasta file before connecting them to sequence.

numberOfFiles

Use only specified number of files, ignored if greater than number of files in corpus.dir.

fileLog

Write name of files to csv file if path is specified.

reverseComplements

Logical, half of batch contains sequences and other its reverse complements. Reverse complement is given by reversed order of sequence and switching A/T and C/G. batch.size argument has to be even, otherwise 1 will be added to batch.size

Value

A list of length 2. First element is a 3-dimensional tensor with dimensions (batch.size, maxlen, length(vocabulary)), encoding the predictor sequences. Second element is a matrix with dimensions (batch.size, length(vocabulary)), encoding the targets.


hiddengenome/deepG documentation built on April 16, 2020, 1:38 a.m.