WordPieceTokenizer: WordPieceTokenizer

WordPieceTokenizerR Documentation

WordPieceTokenizer

Description

Tokenizer based on the WordPiece model (Wu et al. 2016).

Value

Does return a new object of this class.

Super classes

aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer

Methods

Public methods

Inherited methods

Method configure()

Configures a new object of this class.

Usage
WordPieceTokenizer$configure(vocab_size = 10000L, vocab_do_lower_case = FALSE)
Arguments
vocab_size

int Size of the vocabulary. Allowed values: 1000 <= x <= 500000

vocab_do_lower_case

bool TRUE if all tokens should be lower case.

Returns

Does nothing return.


Method train()

Trains a new object of this class

Usage
WordPieceTokenizer$train(
  text_dataset,
  statistics_max_tokens_length = 512L,
  sustain_track = FALSE,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 15L,
  sustain_log_level = "warning",
  trace = FALSE
)
Arguments
text_dataset

LargeDataSetForText LargeDataSetForText Object storing textual data.

statistics_max_tokens_length

int Maximum sequence length for calculating the statistics. Allowed values: 20 <= x <= 8192

sustain_track

bool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.

sustain_iso_code

string ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any

sustain_region

string Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any

sustain_interval

int Interval in seconds for measuring power usage. Allowed values: 1 <= x

sustain_log_level

string Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'

trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

Does nothing return.


Method clone()

The objects of this class are cloneable with this method.

Usage
WordPieceTokenizer$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

References

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>

See Also

Other Tokenizer: HuggingFaceTokenizer


aifeducation documentation built on Nov. 19, 2025, 5:08 p.m.