altum: deepG

fastaFileGenerator Iterates over folder containing .fasta/.fastq files and produces one-hot-encoding of predictor sequences and target variables.

fastaFileGenerator(
  corpus.dir,
  format = "fasta",
  batch.size = 256,
  maxlen = 250,
  max_iter = 2000,
  vocabulary = c("a", "c", "g", "t"),
  verbose = FALSE,
  randomFiles = FALSE,
  step = 1,
  showWarnings = FALSE,
  seed = 1234,
  shuffleFastaEntries = FALSE,
  numberOfFiles = NULL,
  fileLog = NULL,
  reverseComplements = FALSE
)

`corpus.dir`	Input directory where .fasta files are located or path to single file ending with .fasta or .fastq (as specified in format argument).
`format`	File format, either fasta or fastq.
`batch.size`	Number of batches.
`maxlen`	Length of predictor sequence.
`max_iter`	Stop after max_iter number of iterations failed to produce a new batch.
`vocabulary`	Vector of allowed characters, character outside vocabulary get encoded as 0-vector.
`verbose`	Whether to show message.
`randomFiles`	Logical, whether to go through files randomly or sequential.
`step`	How often to take a sample.
`showWarnings`	Logical, give warning if character outside vocabulary appears
`seed`	Sets seed for set.seed function, for reproducible results when using `randomFiles` or `shuffleFastaEntries`
`shuffleFastaEntries`	Logical, shuffle entries in every fasta file before connecting them to sequence.
`numberOfFiles`	Use only specified number of files, ignored if greater than number of files in corpus.dir.
`fileLog`	Write name of files to csv file if path is specified.
`reverseComplements`	Logical, half of batch contains sequences and other its reverse complements. Reverse complement is given by reversed order of sequence and switching A/T and C/G. `batch.size` argument has to be even, otherwise 1 will be added to `batch.size`

A list of length 2. First element is a 3-dimensional tensor with dimensions (batch.size, maxlen, length(vocabulary)), encoding the predictor sequences. Second element is a matrix with dimensions (batch.size, length(vocabulary)), encoding the targets.

hiddengenome/altum documentation built on April 22, 2020, 9:33 p.m.