get_start_ind: Computes start position of samples

View source: R/preprocess.R

get_start_indR Documentation

Computes start position of samples

Description

Helper function for data generators. Computes start positions in sequence where samples can be extracted, given maxlen, step size and ambiguous nucleotide constraints.

Usage

get_start_ind(
  seq_vector,
  length_vector,
  maxlen,
  step,
  train_mode = "label",
  discard_amb_nuc = FALSE,
  vocabulary = c("A", "C", "G", "T")
)

Arguments

seq_vector

Vector of character sequences.

length_vector

Length of sequences in seq_vector.

maxlen

Length of one predictor sequence.

step

Distance between samples from one entry in seq_vector.

train_mode

Either "lm" for language model or "label" for label classification.

discard_amb_nuc

Whether to discard all samples that contain characters outside vocabulary.

vocabulary

Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.

Value

A numeric vector.

Examples

seq_vector <- c("AAACCCNNNGGGTTT")
get_start_ind(
  seq_vector = seq_vector,
  length_vector = nchar(seq_vector),
  maxlen = 4,
  step = 2,
  train_mode = "label",
  discard_amb_nuc = TRUE,
  vocabulary = c("A", "C", "G", "T"))
  

GenomeNet/deepG documentation built on Dec. 24, 2024, 12:11 p.m.