tokenize: Tokenizers for various objects.

View source: R/tokenization.R

tokenizeR Documentation

Tokenizers for various objects.

Description

This tokenizer performs some basic cleaning, then splits up text on whitespace and punctuation.

Usage

tokenize(tokenizer, text)

## S3 method for class 'FullTokenizer'
tokenize(tokenizer, text)

## S3 method for class 'BasicTokenizer'
tokenize(tokenizer, text)

## S3 method for class 'WordpieceTokenizer'
tokenize(tokenizer, text)

Arguments

tokenizer

The Tokenizer object to refer to.

text

The text to tokenize. For tokenize.WordpieceTokenizer, the text should have already been passed through BasicTokenizer.

Value

A list of tokens.

Methods (by class)

  • FullTokenizer: Tokenizer method for objects of FullTokenizer class.

  • BasicTokenizer: Tokenizer method for objects of BasicTokenizer class.

  • WordpieceTokenizer: Tokenizer method for objects of WordpieceTokenizer class. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = list("un", "##aff", "##able") ... although, ironically, the BERT vocabulary actually gives output = list("una", "##ffa", "##ble") for this example, even though they use it as an example in their code.

Examples

## Not run: 
tokenizer <- FullTokenizer("vocab.txt", TRUE)
tokenize(tokenizer, text = "a bunch of words")

## End(Not run)
## Not run: 
tokenizer <- BasicTokenizer(TRUE)
tokenize(tokenizer, text = "a bunch of words")

## End(Not run)
## Not run: 
vocab <- load_vocab(vocab_file = "vocab.txt")
tokenizer <- WordpieceTokenizer(vocab)
tokenize(tokenizer, text = "a bunch of words")

## End(Not run)

jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.