| WordPieceTokenizer | R Documentation |
Tokenizer based on the WordPiece model (Wu et al. 2016).
Does return a new object of this class.
aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::TokenizerBase$calculate_statistics()aifeducation::TokenizerBase$decode()aifeducation::TokenizerBase$encode()aifeducation::TokenizerBase$get_special_tokens()aifeducation::TokenizerBase$get_tokenizer()aifeducation::TokenizerBase$get_tokenizer_statistics()aifeducation::TokenizerBase$load_from_disk()aifeducation::TokenizerBase$n_special_tokens()aifeducation::TokenizerBase$save()configure()Configures a new object of this class.
WordPieceTokenizer$configure(vocab_size = 10000L, vocab_do_lower_case = FALSE)
vocab_sizeint Size of the vocabulary. Allowed values: 1000 <= x <= 500000
vocab_do_lower_casebool TRUE if all tokens should be lower case.
Does nothing return.
train()Trains a new object of this class
WordPieceTokenizer$train( text_dataset, statistics_max_tokens_length = 512L, sustain_track = FALSE, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 15L, sustain_log_level = "warning", trace = FALSE )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_lengthint Maximum sequence length for calculating the statistics. Allowed values: 20 <= x <= 8192
sustain_trackbool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values: 1 <= x
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
Does nothing return.
clone()The objects of this class are cloneable with this method.
WordPieceTokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>
Other Tokenizer:
HuggingFaceTokenizer
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.