labelByFolderGenerator: Custom generator for fasta files

Description Usage Arguments Value

View source: R/generators.R

Description

labelByFolderGenerator Iterates over folder containing .fasta files and produces one-hot-encoding of predictor sequences and target variables. Files in corpus.dir should all belong to one class.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
labelByFolderGenerator(
  corpus.dir,
  format = "fasta",
  batch.size = 256,
  maxlen = 250,
  max_iter = 10000,
  vocabulary = c("a", "c", "g", "t"),
  verbose = FALSE,
  randomFiles = FALSE,
  step = 1,
  showWarnings = FALSE,
  seed = 1234,
  shuffleFastaEntries = FALSE,
  numberOfFiles = NULL,
  fileLog = NULL,
  reverseComplements = TRUE,
  numTargets,
  onesColumn
)

Arguments

corpus.dir

Input directory where .fasta files are located or path to single file ending with .fasta or .fastq (as specified in format argument).

format

File format, either fasta or fastq.

batch.size

Number of batches.

maxlen

Length of predictor sequence.

max_iter

Stop after max_iter number of iterations failed to produce a new batch.

vocabulary

Vector of allowed characters, character outside vocabulary get encoded as 0-vector.

verbose

Whether to show message.

randomFiles

Logical, whether to go through files randomly or sequential.

step

How often to take a sample.

showWarnings

Logical, give warning if character outside vocabulary appears

seed

Sets seed for set.seed function, for reproducible results when using randomFiles or shuffleFastaEntries

shuffleFastaEntries

Logical, shuffle fasta entries.

numberOfFiles

Use only specified number of files, ignored if greater than number of files in corpus.dir.

fileLog

Write name of files to csv file if path is specified.

reverseComplements

Logical, half of batch contains sequences and other its reverse complements. Reverse complement is given by reversed order of sequence and switching A/T and C/G. batch.size argument has to be even, otherwise 1 will be added to batch.size

numTargets

Number of columns of target matrix.

onesColumn

Which column of target matrix contains ones

Value

A list of length 2. First element is a 3-dimensional tensor with dimensions (batch.size, maxlen, length(vocabulary)), encoding the predictor sequences. Second element is a matrix with dimensions (batch.size, length(vocabulary)), encoding the targets.


hiddengenome/altum documentation built on April 22, 2020, 9:33 p.m.