WordPieceTokenizer: WordPieceTokenizer
In aifeducation: Artificial Intelligence for Education

WordPieceTokenizer

R Documentation

WordPieceTokenizer

Description

Tokenizer based on the WordPiece model (Wu et al. 2016).

Value

Does return a new object of this class.

Super classes

aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer

Methods

Inherited methods

Method `configure()`

Configures a new object of this class.

Usage

WordPieceTokenizer$configure(vocab_size = 10000L, vocab_do_lower_case = FALSE)

Arguments

vocab_size: int Size of the vocabulary. Allowed values: 1000 <= x <= 500000
vocab_do_lower_case: bool TRUE if all tokens should be lower case.

Returns

Does nothing return.

Method `train()`

Trains a new object of this class

Usage

WordPieceTokenizer$train(
  text_dataset,
  statistics_max_tokens_length = 512L,
  sustain_track = FALSE,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 15L,
  sustain_log_level = "warning",
  trace = FALSE
)

Arguments

text_dataset: LargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_length: int Maximum sequence length for calculating the statistics. Allowed values: 20 <= x <= 8192
sustain_track: bool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_code: string ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_region: string Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any
sustain_interval: int Interval in seconds for measuring power usage. Allowed values: 1 <= x
sustain_log_level: string Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
trace: bool TRUE if information about the estimation phase should be printed to the console.

Returns

Does nothing return.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

WordPieceTokenizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>

aifeducation
Artificial Intelligence for Education

WordPieceTokenizer: WordPieceTokenizer
In aifeducation: Artificial Intelligence for Education