Description
Usage
Arguments
Details
View source: R/layer-text_vectorization.R
This layer has basic options for managing text in a Keras model. It
transforms a batch of strings (one sample = one string) into either a list of
token indices (one sample = 1D tensor of integer token indices) or a dense
representation (one sample = 1D tensor of float values representing data about
the sample's tokens).
| (object, max_tokens = ,
standardize = "lower_and_strip_punctuation", = "whitespace",
ngrams = , output_mode = ("int", "binary", "count", "tfidf"),
output_sequence_length = , pad_to_max_tokens = , )
|
object |
Model or layer object
|
max_tokens |
The maximum size of the vocabulary for this layer. If NULL ,
there is no cap on the size of the vocabulary.
|
standardize |
Optional specification for standardization to apply to the
input text. Values can be NULL (no standardization),
"lower_and_strip_punctuation" (lowercase and remove punctuation) or a
Callable. Default is "lower_and_strip_punctuation" .
|
split |
Optional specification for splitting the input text. Values can be
NULL (no splitting), "split_on_whitespace" (split on ASCII whitespace), or
a Callable. Default is "split_on_whitespace" .
|
ngrams |
Optional specification for ngrams to create from the possibly-split
input text. Values can be NULL , an integer or a list of integers; passing
an integer will create ngrams up to that integer, and passing a list of
integers will create ngrams for the specified values in the list. Passing
NULL means that no ngrams will be created.
|
output_mode |
Optional specification for the output of the layer. Values can
be "int" , "binary" , "count" or "tfidf" , which control the outputs as follows:
"int": Outputs integer indices, one integer index per split string token.
"binary": Outputs a single int array per batch, of either vocab_size or
max_tokens size, containing 1s in all elements where the token mapped
to that index exists at least once in the batch item.
"count": As "binary", but the int array contains a count of the number of
times the token at that index appeared in the batch item.
"tfidf": As "binary", but the TF-IDF algorithm is applied to find the value
in each token slot.
|
output_sequence_length |
Only valid in "int" mode. If set, the output will have
its time dimension padded or truncated to exactly output_sequence_length
values, resulting in a tensor of shape (batch_size, output_sequence_length) regardless
of how many tokens resulted from the splitting step. Defaults to NULL .
|
pad_to_max_tokens |
Only valid in "binary", "count", and "tfidf" modes. If TRUE ,
the output will have its feature axis padded to max_tokens even if the
number of unique tokens in the vocabulary is less than max_tokens,
resulting in a tensor of shape (batch_size, max_tokens) regardless of
vocabulary size. Defaults to TRUE .
|
... |
Not used.
|
The processing of each sample contains the following steps:
standardize each sample (usually lowercasing + punctuation stripping)
split each sample into substrings (usually words)
recombine substrings into tokens (usually ngrams)
index tokens (associate a unique int value with each token)
transform each sample using this index, either into a vector of ints or
a dense float vector.
dfalbel/keras documentation built on Nov. 27, 2019, 8:16 p.m.