| TokenizerBase | R Documentation |
Base class for tokenizers containing all methods shared by the sub-classes.
Does return a new object of this class.
Returns a data.frame containing the estimates.
aifeducation::AIFEMaster -> TokenizerBase
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()save()Method for saving a model on disk.
TokenizerBase$save(dir_path, folder_name)
dir_pathPath to the directory where to save the object.
folder_namestring Name of the folder where the model should be saved. Allowed values: any
Function does nothing return. It is used to save an object on disk.
load_from_disk()Loads an object from disk and updates the object to the current version of the package.
TokenizerBase$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Function does nothin return. It loads an object from disk.
get_tokenizer_statistics()Tokenizer statistics
TokenizerBase$get_tokenizer_statistics()
Returns a data.frame containing the tokenizer's statistics.
get_tokenizer()Python tokenizer
TokenizerBase$get_tokenizer()
Returns the python tokenizer within the model.
encode()Method for encoding words of raw texts into integers.
TokenizerBase$encode( raw_text, token_overlap = 0L, max_token_sequence_length = 512L, n_chunks = 1L, token_encodings_only = FALSE, token_to_int = TRUE, return_token_type_ids = TRUE, trace = FALSE )
raw_textvector Raw text.
token_overlapint Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values: 0 <= x
max_token_sequence_lengthint Maximal number of tokens per chunk. Allowed values: 20 <= x
n_chunksint Maximal number chunks. Allowed values: 2 <= x
token_encodings_onlybool
TRUE: Returns a list containg only the tokens.
FALSE: Returns a list containg a list for the tokens, the number of chunks, and
the number potential number of chunks for each document/text.
token_to_intbool
TRUE: Returns the tokens as int index.
FALSE: Returns the tokens as strings.
return_token_type_idsbool If TRUE additionally returns the return_token_type_ids.
tracebool TRUE if information about the estimation phase should be printed to the console.
list containing the integer or token sequences of the raw texts with
special tokens.
decode()Method for decoding a sequence of integers into tokens
TokenizerBase$decode(int_seqence, to_token = FALSE)
int_seqencelist list of integer sequence that should be converted to tokens.
to_tokenbool
FALSE: Transforms the integers to plain text.
TRUE: Transforms the integers to a sequence of tokens.
list of token sequences
get_special_tokens()Method for receiving the special tokens of the model
TokenizerBase$get_special_tokens()
Returns a matrix containing the special tokens in the rows
and their type, token, and id in the columns.
n_special_tokens()Method for receiving the special tokens of the model
TokenizerBase$n_special_tokens()
Returns an 'int' counting the number of special tokens.
calculate_statistics()Method for calculating tokenizer statistics as suggested by Kaya and Tantuğ (2024).
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. <https://doi.org/10.1016/j.iswa.2024.200335>
TokenizerBase$calculate_statistics( text_dataset, statistics_max_tokens_length, step = "creation" )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_lengthint Maximum sequence length for calculating the statistics. Allowed values: 20 <= x <= 8192
stepstring describing the context of the estimation.
Returns an 'int' counting the number of special tokens.
clone()The objects of this class are cloneable with this method.
TokenizerBase$clone(deep = FALSE)
deepWhether to make a deep clone.
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.