| TextEmbeddingModel | R Documentation |
This R6 class stores a text embedding model which can be used to tokenize, encode, decode, and embed
raw texts. The object provides a unique interface for different text processing methods.
Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.
last_training('list()')
List for storing the history and the results of the last training. This
information will be overwritten if a new training is started.
tokenizer_statistics('matrix()')
Matrix containing the tokenizer statistics for the creation of the tokenizer
and all training runs according to Kaya & Tantuğ (2024).
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335
configure()Method for creating a new text embedding model
TextEmbeddingModel$configure( model_name = NULL, model_label = NULL, model_language = NULL, method = NULL, ml_framework = "pytorch", max_length = 0, chunks = 2, overlap = 0, emb_layer_min = "middle", emb_layer_max = "2_3_layer", emb_pool_type = "average", model_dir = NULL, trace = FALSE )
model_namestring containing the name of the new model.
model_labelstring containing the label/title of the new model.
model_languagestring containing the language which the model
represents (e.g., English).
methodstring determining the kind of embedding model. Currently
the following models are supported:
method="bert" for Bidirectional Encoder Representations from Transformers (BERT),
method="roberta" for A Robustly Optimized BERT Pretraining Approach (RoBERTa),
method="longformer" for Long-Document Transformer,
method="funnel" for Funnel-Transformer,
method="deberta_v2" for Decoding-enhanced BERT with Disentangled Attention (DeBERTa V2),
method="glove"`` for GlobalVector Clusters, and method="lda"' for topic modeling. See
details for more information.
ml_frameworkstring Framework to use for the model.
ml_framework="tensorflow" for 'tensorflow' and ml_framework="pytorch"
for 'pytorch'. Only relevant for transformer models. To request bag-of-words model
set ml_framework=NULL.
max_lengthint determining the maximum length of token
sequences used in transformer models. Not relevant for the other methods.
chunksint Maximum number of chunks. Must be at least 2.
overlapint determining the number of tokens which should be added
at the beginning of the next chunk. Only relevant for transformer models.
emb_layer_minint or string determining the first layer to be included
in the creation of embeddings. An integer correspondents to the layer number. The first
layer has the number 1. Instead of an integer the following strings are possible:
"start" for the first layer, "middle" for the middle layer,
"2_3_layer" for the layer two-third layer, and "last" for the last layer.
emb_layer_maxint or string determining the last layer to be included
in the creation of embeddings. An integer correspondents to the layer number. The first
layer has the number 1. Instead of an integer the following strings are possible:
"start" for the first layer, "middle" for the middle layer,
"2_3_layer" for the layer two-third layer, and "last" for the last layer.
emb_pool_typestring determining the method for pooling the token embeddings
within each layer. If "cls" only the embedding of the CLS token is used. If
"average" the token embedding of all tokens are averaged (excluding padding tokens).
"cls is not supported for method="funnel".
model_dirstring path to the directory where the
BERT model is stored.
tracebool TRUE prints information about the progress.
FALSE does not.
In the case of any transformer (e.g.method="bert",
method="roberta", and method="longformer"),
a pretrained transformer model must be supplied via model_dir.
Returns an object of class TextEmbeddingModel.
load_from_disk()loads an object from disk and updates the object to the current version of the package.
TextEmbeddingModel$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Method does not return anything. It loads an object from disk.
load()Method for loading a transformers model into R.
TextEmbeddingModel$load(dir_path)
dir_pathstring containing the path to the relevant
model directory.
Function does not return a value. It is used for loading a saved transformer model into the R interface.
save()Method for saving a transformer model on disk.Relevant only for transformer models.
TextEmbeddingModel$save(dir_path, folder_name)
dir_pathstring containing the path to the relevant
model directory.
folder_namestring Name for the folder created within the directory.
This folder contains all model files.
Function does not return a value. It is used for saving a transformer model to disk.
encode()Method for encoding words of raw texts into integers.
TextEmbeddingModel$encode( raw_text, token_encodings_only = FALSE, to_int = TRUE, trace = FALSE )
raw_textvectorcontaining the raw texts.
token_encodings_onlybool If TRUE, only the token
encodings are returned. If FALSE, the complete encoding is returned
which is important for some transformer models.
to_intbool If TRUE the integer ids of the tokens are
returned. If FALSE the tokens are returned. Argument only applies
for transformer models and if token_encodings_only=TRUE.
tracebool If TRUE, information of the progress
is printed. FALSE if not requested.
list containing the integer or token sequences of the raw texts with
special tokens.
decode()Method for decoding a sequence of integers into tokens
TextEmbeddingModel$decode(int_seqence, to_token = FALSE)
int_seqencelist containing the integer sequences which
should be transformed to tokens or plain text.
to_tokenbool If FALSE plain text is returned.
If TRUE a sequence of tokens is returned. Argument only relevant
if the model is based on a transformer.
list of token sequences
get_special_tokens()Method for receiving the special tokens of the model
TextEmbeddingModel$get_special_tokens()
Returns a matrix containing the special tokens in the rows
and their type, token, and id in the columns.
embed()Method for creating text embeddings from raw texts.
This method should only be used if a small number of texts should be transformed
into text embeddings. For a large number of texts please use the method embed_large.
In the case of using a GPU and running out of memory while using 'tensorflow' reduce the
batch size or restart R and switch to use cpu only via set_config_cpu_only. In general,
not relevant for 'pytorch'.
TextEmbeddingModel$embed( raw_text = NULL, doc_id = NULL, batch_size = 8, trace = FALSE, return_large_dataset = FALSE )
raw_textvector containing the raw texts.
doc_idvector containing the corresponding IDs for every text.
batch_sizeint determining the maximal size of every batch.
tracebool TRUE, if information about the progression
should be printed on console.
return_large_dataset'bool' If TRUE the retuned object is of class
LargeDataSetForTextEmbeddings. If FALSE it is of class EmbeddedText
Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.
embed_large()Method for creating text embeddings from raw texts.
TextEmbeddingModel$embed_large( large_datas_set, batch_size = 32, trace = FALSE, log_file = NULL, log_write_interval = 2 )
large_datas_setObject of class LargeDataSetForText containing the raw texts.
batch_sizeint determining the maximal size of every batch.
tracebool TRUE, if information about the progression
should be printed on console.
log_filestring Path to the file where the log should be saved.
If no logging is desired set this argument to NULL.
log_write_intervalint Time in seconds determining the interval in which
the logger should try to update the log files. Only relevant if log_file is not NULL.
Method returns an object of class LargeDataSetForTextEmbeddings.
fill_mask()Method for calculating tokens behind mask tokens.
TextEmbeddingModel$fill_mask(text, n_solutions = 5)
textstring Text containing mask tokens.
n_solutionsint Number estimated tokens for every mask.
Returns a list containing a data.frame for every
mask. The data.frame contains the solutions in the rows and reports
the score, token id, and token string in the columns.
set_publication_info()Method for setting the bibliographic information of the model.
TextEmbeddingModel$set_publication_info(type, authors, citation, url = NULL)
typestring Type of information which should be changed/added.
developer, and modifier are possible.
authorsList of people.
citationstring Citation in free text.
urlstring Corresponding URL if applicable.
Function does not return a value. It is used to set the private members for publication information of the model.
get_publication_info()Method for getting the bibliographic information of the model.
TextEmbeddingModel$get_publication_info()
list of bibliographic information.
set_model_license()Method for setting the license of the model
TextEmbeddingModel$set_model_license(license = "CC BY")
licensestring containing the abbreviation of the license or
the license text.
Function does not return a value. It is used for setting the private member for the software license of the model.
get_model_license()Method for requesting the license of the model
TextEmbeddingModel$get_model_license()
string License of the model
set_documentation_license()Method for setting the license of models' documentation.
TextEmbeddingModel$set_documentation_license(license = "CC BY")
licensestring containing the abbreviation of the license or
the license text.
Function does not return a value. It is used to set the private member for the documentation license of the model.
get_documentation_license()Method for getting the license of the models' documentation.
TextEmbeddingModel$get_documentation_license()
licensestring containing the abbreviation of the license or
the license text.
set_model_description()Method for setting a description of the model
TextEmbeddingModel$set_model_description( eng = NULL, native = NULL, abstract_eng = NULL, abstract_native = NULL, keywords_eng = NULL, keywords_native = NULL )
engstring A text describing the training of the classifier,
its theoretical and empirical background, and the different output labels
in English.
nativestring A text describing the training of the classifier,
its theoretical and empirical background, and the different output labels
in the native language of the model.
abstract_engstring A text providing a summary of the description
in English.
abstract_nativestring A text providing a summary of the description
in the native language of the classifier.
keywords_engvectorof keywords in English.
keywords_nativevectorof keywords in the native language of the classifier.
Function does not return a value. It is used to set the private members for the description of the model.
get_model_description()Method for requesting the model description.
TextEmbeddingModel$get_model_description()
list with the description of the model in English
and the native language.
get_model_info()Method for requesting the model information
TextEmbeddingModel$get_model_info()
list of all relevant model information
get_package_versions()Method for requesting a summary of the R and python packages' versions used for creating the model.
TextEmbeddingModel$get_package_versions()
Returns a list containing the versions of the relevant
R and python packages.
get_basic_components()Method for requesting the part of interface's configuration that is necessary for all models.
TextEmbeddingModel$get_basic_components()
Returns a list.
get_transformer_components()Method for requesting the part of interface's configuration that is necessary for transformer models.
TextEmbeddingModel$get_transformer_components()
Returns a list.
get_sustainability_data()Method for requesting a log of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
TextEmbeddingModel$get_sustainability_data()
Returns a matrix containing the tracked energy consumption,
CO2 equivalents in kg, information on the tracker used, and technical
information on the training infrastructure for every training run.
get_ml_framework()Method for requesting the machine learning framework used for the classifier.
TextEmbeddingModel$get_ml_framework()
Returns a string describing the machine learning framework used
for the classifier.
count_parameter()Method for counting the trainable parameters of a model.
TextEmbeddingModel$count_parameter(with_head = FALSE)
with_headbool If TRUE the number of parameters is returned including
the language modeling head of the model. If FALSE only the number of parameters of
the core model is returned.
Returns the number of trainable parameters of the model.
is_configured()Method for checking if the model was successfully configured.
An object can only be used if this value is TRUE.
TextEmbeddingModel$is_configured()
bool TRUE if the model is fully configured. FALSE if not.
get_private()Method for requesting all private fields and methods. Used for loading and updating an object.
TextEmbeddingModel$get_private()
Returns a list with all private fields and methods.
get_all_fields()Return all fields.
TextEmbeddingModel$get_all_fields()
Method returns a list containing all public and private fields
of the object.
clone()The objects of this class are cloneable with this method.
TextEmbeddingModel$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Text Embedding:
TEFeatureExtractor
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.