seq_encoding_lm: Encodes integer sequence for language model
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

seq_encoding_lm

R Documentation

Encodes integer sequence for language model

Description

Helper function for generator_fasta_lm. Encodes integer sequence to input/target list according to output_format argument.

Usage

seq_encoding_lm(
  sequence = NULL,
  maxlen,
  vocabulary,
  start_ind,
  ambiguous_nuc = "zero",
  nuc_dist = NULL,
  quality_vector = NULL,
  return_int = FALSE,
  target_len = 1,
  use_coverage = FALSE,
  max_cov = NULL,
  cov_vector = NULL,
  n_gram = NULL,
  n_gram_stride = 1,
  output_format = "target_right",
  char_sequence = NULL,
  adjust_start_ind = FALSE,
  tokenizer = NULL
)

Arguments

`sequence`	Sequence of integers.
`maxlen`	Length of predictor sequence.
`vocabulary`	Vector of allowed characters. Characters outside vocabulary get encoded as specified in `ambiguous_nuc`.
`start_ind`	Start positions of samples in `sequence`.
`ambiguous_nuc`	How to handle nucleotides outside vocabulary, either `"zero"`, `"empirical"` or `"equal"`. See `train_model`. Note that `"discard"` option is not available for this function.
`nuc_dist`	Nucleotide distribution.
`quality_vector`	Vector of quality probabilities.
`return_int`	Whether to return integer encoding or one-hot encoding.
`target_len`	Number of nucleotides to predict at once for language model.
`use_coverage`	Integer or `NULL`. If not `NULL`, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string `"cov_n"` in the header, where `n` is some integer.
`max_cov`	Biggest coverage value. Only applies if `use_coverage = TRUE`.
`cov_vector`	Vector of coverage values associated to the input.
`n_gram`	Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for `⁠n=2, "AA" -> (1, 0,..., 0),⁠` `⁠"AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1)⁠`, where the one-hot vectors have length `length(vocabulary)^n`.
`n_gram_stride`	Step size for n-gram encoding. For AACCGGTT with `n_gram = 4` and `n_gram_stride = 2`, generator encodes `⁠(AACC), (CCGG), (GGTT)⁠`; for `n_gram_stride = 4` generator encodes `⁠(AACC), (GGTT)⁠`.
`output_format`	Determines shape of output tensor for language model. Either `"target_right"`, `"target_middle_lstm"`, `"target_middle_cnn"` or `"wavenet"`. Assume a sequence `"AACCGTA"`. Output correspond as follows `⁠"target_right": X = "AACCGT", Y = "A"⁠` `⁠"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C"⁠` (note reversed order of X_2) `⁠"target_middle_cnn": X = "AACGTA", Y = "C"⁠` `⁠"wavenet": X = "AACCGT", Y = "ACCGTA"⁠`
`char_sequence`	A character string.
`adjust_start_ind`	Whether to shift values in `start_ind` to start at 1: for example (5,11,25) becomes (1,7,21).
`tokenizer`	A keras tokenizer.

Value

A list of 2 tensors.

Examples


# use integer sequence as input 

z <- seq_encoding_lm(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal",
target_len = 1,
output_format = "target_right")

x <- z[[1]]
y <- z[[2]]

x[1,,] # 1,0,5,1,3
y[1,] # 4

x[2,,] # 5,1,3,4,
y[2,] # 1

# use character string as input
z <- seq_encoding_lm(sequence = NULL,
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "zero",
target_len = 1,
output_format = "target_right",
char_sequence = "ACTaaTNTNaZ")


x <- z[[1]]
y <- z[[2]]

x[1,,] # actaa
y[1,] # t

x[2,,] # taatn
y[2,] # t

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.