seq_encoding_lm | R Documentation |
Helper function for generator_fasta_lm
.
Encodes integer sequence to input/target list according to output_format
argument.
seq_encoding_lm(
sequence = NULL,
maxlen,
vocabulary,
start_ind,
ambiguous_nuc = "zero",
nuc_dist = NULL,
quality_vector = NULL,
return_int = FALSE,
target_len = 1,
use_coverage = FALSE,
max_cov = NULL,
cov_vector = NULL,
n_gram = NULL,
n_gram_stride = 1,
output_format = "target_right",
char_sequence = NULL,
adjust_start_ind = FALSE,
tokenizer = NULL
)
sequence |
Sequence of integers. |
maxlen |
Length of predictor sequence. |
vocabulary |
Vector of allowed characters. Characters outside vocabulary get encoded as specified in |
start_ind |
Start positions of samples in |
ambiguous_nuc |
How to handle nucleotides outside vocabulary, either |
nuc_dist |
Nucleotide distribution. |
quality_vector |
Vector of quality probabilities. |
return_int |
Whether to return integer encoding or one-hot encoding. |
target_len |
Number of nucleotides to predict at once for language model. |
use_coverage |
Integer or |
max_cov |
Biggest coverage value. Only applies if |
cov_vector |
Vector of coverage values associated to the input. |
n_gram |
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for |
n_gram_stride |
Step size for n-gram encoding. For AACCGGTT with |
output_format |
Determines shape of output tensor for language model.
Either
|
char_sequence |
A character string. |
adjust_start_ind |
Whether to shift values in |
tokenizer |
A keras tokenizer. |
A list of 2 tensors.
# use integer sequence as input
z <- seq_encoding_lm(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal",
target_len = 1,
output_format = "target_right")
x <- z[[1]]
y <- z[[2]]
x[1,,] # 1,0,5,1,3
y[1,] # 4
x[2,,] # 5,1,3,4,
y[2,] # 1
# use character string as input
z <- seq_encoding_lm(sequence = NULL,
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "zero",
target_len = 1,
output_format = "target_right",
char_sequence = "ACTaaTNTNaZ")
x <- z[[1]]
y <- z[[2]]
x[1,,] # actaa
y[1,] # t
x[2,,] # taatn
y[2,] # t
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.