layer_text_vectorization: Text vectorization layer
In dfalbel/keras: R Interface to 'Keras'

View source: R/layer-text_vectorization.R

This layer has basic options for managing text in a Keras model. It transforms a batch of strings (one sample = one string) into either a list of token indices (one sample = 1D tensor of integer token indices) or a dense representation (one sample = 1D tensor of float values representing data about the sample's tokens).

layer_text_vectorization(object, max_tokens = NULL,
  standardize = "lower_and_strip_punctuation", split = "whitespace",
  ngrams = NULL, output_mode = c("int", "binary", "count", "tfidf"),
  output_sequence_length = NULL, pad_to_max_tokens = TRUE, ...)

`object`	Model or layer object
`max_tokens`	The maximum size of the vocabulary for this layer. If `NULL`, there is no cap on the size of the vocabulary.
`standardize`	Optional specification for standardization to apply to the input text. Values can be `NULL` (no standardization), `"lower_and_strip_punctuation"` (lowercase and remove punctuation) or a Callable. Default is `"lower_and_strip_punctuation"`.
`split`	Optional specification for splitting the input text. Values can be `NULL` (no splitting), `"split_on_whitespace"` (split on ASCII whitespace), or a Callable. Default is `"split_on_whitespace"`.
`ngrams`	Optional specification for ngrams to create from the possibly-split input text. Values can be `NULL`, an integer or a list of integers; passing an integer will create ngrams up to that integer, and passing a list of integers will create ngrams for the specified values in the list. Passing `NULL` means that no ngrams will be created.
`output_mode`	Optional specification for the output of the layer. Values can be `"int"`, `"binary"`, `"count"` or `"tfidf"`, which control the outputs as follows: "int": Outputs integer indices, one integer index per split string token. "binary": Outputs a single int array per batch, of either vocab_size or `max_tokens` size, containing 1s in all elements where the token mapped to that index exists at least once in the batch item. "count": As "binary", but the int array contains a count of the number of times the token at that index appeared in the batch item. "tfidf": As "binary", but the TF-IDF algorithm is applied to find the value in each token slot.
`output_sequence_length`	Only valid in "int" mode. If set, the output will have its time dimension padded or truncated to exactly `output_sequence_length` values, resulting in a tensor of shape (batch_size, output_sequence_length) regardless of how many tokens resulted from the splitting step. Defaults to `NULL`.
`pad_to_max_tokens`	Only valid in "binary", "count", and "tfidf" modes. If `TRUE`, the output will have its feature axis padded to `max_tokens` even if the number of unique tokens in the vocabulary is less than max_tokens, resulting in a tensor of shape (batch_size, max_tokens) regardless of vocabulary size. Defaults to `TRUE`.
`...`	Not used.

The processing of each sample contains the following steps:

standardize each sample (usually lowercasing + punctuation stripping)
split each sample into substrings (usually words)
recombine substrings into tokens (usually ngrams)
index tokens (associate a unique int value with each token)
transform each sample using this index, either into a vector of ints or a dense float vector.