seq_encoding_label: Encodes integer sequence for label classification.
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

seq_encoding_label

R Documentation

Encodes integer sequence for label classification.

Description

Returns encoding for integer or character sequence.

Usage

seq_encoding_label(
  sequence = NULL,
  maxlen,
  vocabulary,
  start_ind,
  ambiguous_nuc = "zero",
  nuc_dist = NULL,
  quality_vector = NULL,
  use_coverage = FALSE,
  max_cov = NULL,
  cov_vector = NULL,
  n_gram = NULL,
  n_gram_stride = 1,
  masked_lm = NULL,
  char_sequence = NULL,
  tokenizer = NULL,
  adjust_start_ind = FALSE,
  return_int = FALSE
)

Arguments

`sequence`	Sequence of integers.
`maxlen`	Length of predictor sequence.
`vocabulary`	Vector of allowed characters. Characters outside vocabulary get encoded as specified in `ambiguous_nuc`.
`start_ind`	Start positions of samples in `sequence`.
`ambiguous_nuc`	How to handle nucleotides outside vocabulary, either `"zero"`, `"empirical"` or `"equal"`. See `train_model`. Note that `"discard"` option is not available for this function.
`nuc_dist`	Nucleotide distribution.
`quality_vector`	Vector of quality probabilities.
`use_coverage`	Integer or `NULL`. If not `NULL`, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string `"cov_n"` in the header, where `n` is some integer.
`max_cov`	Biggest coverage value. Only applies if `use_coverage = TRUE`.
`cov_vector`	Vector of coverage values associated to the input.
`n_gram`	Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for `⁠n=2, "AA" -> (1, 0,..., 0),⁠` `⁠"AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1)⁠`, where the one-hot vectors have length `length(vocabulary)^n`.
`n_gram_stride`	Step size for n-gram encoding. For AACCGGTT with `n_gram = 4` and `n_gram_stride = 2`, generator encodes `⁠(AACC), (CCGG), (GGTT)⁠`; for `n_gram_stride = 4` generator encodes `⁠(AACC), (GGTT)⁠`.
`masked_lm`	If not `NULL`, input and target are equal except some parts of the input are masked or random. Must be list with the following arguments: `mask_rate`: Rate of input to mask (rate of input to replace with mask token). `random_rate`: Rate of input to set to random token. `identity_rate`: Rate of input where sample weights are applied but input and output are identical. `include_sw`: Whether to include sample weights. `block_len` (optional): Masked/random/identity regions appear in blocks of size `block_len`.
`char_sequence`	A character string.
`tokenizer`	A keras tokenizer.
`adjust_start_ind`	Whether to shift values in `start_ind` to start at 1: for example (5,11,25) becomes (1,7,21).
`return_int`	Whether to return integer encoding or one-hot encoding.

Value

A list of 2 tensors.

Examples


# use integer sequence as input
x <- seq_encoding_label(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
                        maxlen = 5,
                        vocabulary = c("a", "c", "g", "t"),
                        start_ind = c(1,3),
                        ambiguous_nuc = "equal")

x[1,,] # 1,0,5,1,3

x[2,,] # 5,1,3,4,

# use character string as input
x <- seq_encoding_label(maxlen = 5,
                        vocabulary = c("a", "c", "g", "t"),
                        start_ind = c(1,3),
                        ambiguous_nuc = "equal",
                        char_sequence = "ACTaaTNTNaZ")

x[1,,] # actaa

x[2,,] # taatn

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.