get_start_ind: Computes start position of samples
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

get_start_ind

R Documentation

Computes start position of samples

Description

Helper function for data generators. Computes start positions in sequence where samples can be extracted, given maxlen, step size and ambiguous nucleotide constraints.

Usage

get_start_ind(
  seq_vector,
  length_vector,
  maxlen,
  step,
  train_mode = "label",
  discard_amb_nuc = FALSE,
  vocabulary = c("A", "C", "G", "T")
)

Arguments

`seq_vector`	Vector of character sequences.
`length_vector`	Length of sequences in `seq_vector`.
`maxlen`	Length of one predictor sequence.
`step`	Distance between samples from one entry in `seq_vector`.
`train_mode`	Either `"lm"` for language model or `"label"` for label classification.
`discard_amb_nuc`	Whether to discard all samples that contain characters outside vocabulary.
`vocabulary`	Vector of allowed characters. Characters outside vocabulary get encoded as specified in `ambiguous_nuc`.

Value

A numeric vector.

Examples

seq_vector <- c("AAACCCNNNGGGTTT")
get_start_ind(
  seq_vector = seq_vector,
  length_vector = nchar(seq_vector),
  maxlen = 4,
  step = 2,
  train_mode = "label",
  discard_amb_nuc = TRUE,
  vocabulary = c("A", "C", "G", "T"))

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.