tokenize: Tokenizers for various objects.
In jonathanbratt/RBERT: R Implementation of BERT

tokenize

R Documentation

Tokenizers for various objects.

Description

This tokenizer performs some basic cleaning, then splits up text on whitespace and punctuation.

Usage

tokenize(tokenizer, text)

## S3 method for class 'FullTokenizer'
tokenize(tokenizer, text)

## S3 method for class 'BasicTokenizer'
tokenize(tokenizer, text)

## S3 method for class 'WordpieceTokenizer'
tokenize(tokenizer, text)

Arguments

`tokenizer`	The Tokenizer object to refer to.
`text`	The text to tokenize. For tokenize.WordpieceTokenizer, the text should have already been passed through BasicTokenizer.

Value

A list of tokens.

Methods (by class)

FullTokenizer: Tokenizer method for objects of FullTokenizer class.
BasicTokenizer: Tokenizer method for objects of BasicTokenizer class.
WordpieceTokenizer: Tokenizer method for objects of WordpieceTokenizer class. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = list("un", "##aff", "##able") ... although, ironically, the BERT vocabulary actually gives output = list("una", "##ffa", "##ble") for this example, even though they use it as an example in their code.

Examples

## Not run: 
tokenizer <- FullTokenizer("vocab.txt", TRUE)
tokenize(tokenizer, text = "a bunch of words")

## End(Not run)
## Not run: 
tokenizer <- BasicTokenizer(TRUE)
tokenize(tokenizer, text = "a bunch of words")

## End(Not run)
## Not run: 
vocab <- load_vocab(vocab_file = "vocab.txt")
tokenizer <- WordpieceTokenizer(vocab)
tokenize(tokenizer, text = "a bunch of words")

## End(Not run)

jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.