tokenize | R Documentation |
This tokenizer performs some basic cleaning, then splits up text on whitespace and punctuation.
tokenize(tokenizer, text) ## S3 method for class 'FullTokenizer' tokenize(tokenizer, text) ## S3 method for class 'BasicTokenizer' tokenize(tokenizer, text) ## S3 method for class 'WordpieceTokenizer' tokenize(tokenizer, text)
tokenizer |
The Tokenizer object to refer to. |
text |
The text to tokenize. For tokenize.WordpieceTokenizer, the text should have already been passed through BasicTokenizer. |
A list of tokens.
FullTokenizer
: Tokenizer method for objects of FullTokenizer class.
BasicTokenizer
: Tokenizer method for objects of BasicTokenizer class.
WordpieceTokenizer
: Tokenizer method for objects of WordpieceTokenizer
class. This uses a greedy longest-match-first algorithm to perform
tokenization using the given vocabulary. For example: input = "unaffable"
output = list("un", "##aff", "##able") ... although, ironically, the BERT
vocabulary actually gives output = list("una", "##ffa", "##ble") for this
example, even though they use it as an example in their code.
## Not run: tokenizer <- FullTokenizer("vocab.txt", TRUE) tokenize(tokenizer, text = "a bunch of words") ## End(Not run) ## Not run: tokenizer <- BasicTokenizer(TRUE) tokenize(tokenizer, text = "a bunch of words") ## End(Not run) ## Not run: vocab <- load_vocab(vocab_file = "vocab.txt") tokenizer <- WordpieceTokenizer(vocab) tokenize(tokenizer, text = "a bunch of words") ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.