| TextEmbeddingModel | R Documentation |
This R6 class stores a text embedding model which can be used to tokenize, encode, decode, and embed
raw texts. The object provides a unique interface for different text processing methods.
Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.
aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> TextEmbeddingModel
BaseModel('BaseModelCore')
Object of class BaseModelCore.
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEBaseModel$count_parameter()configure()Method for creating a new text embedding model
TextEmbeddingModel$configure( model_name = NULL, model_label = NULL, model_language = NULL, max_length = 512L, chunks = 2L, overlap = 0L, emb_layer_min = 1L, emb_layer_max = 2L, emb_pool_type = "Average", pad_value = -100L, base_model = NULL )
model_namestring Name of the new model. Please refer to common name conventions.
Free text can be used with parameter label. If set to NULL a unique ID
is generated automatically. Allowed values: any
model_labelstring Label for the new model. Here you can use free text. Allowed values: any
model_languagestring Languages that the models can work with. Allowed values: any
max_lengthint Maximal number of token per chunks. Must be equal or lower
as the maximal postional embeddings for the model. Allowed values: 20 <= x
chunksint Maximal number chunks. Allowed values: 2 <= x
overlapint Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values: 0 <= x
emb_layer_minint Minimal layer from which the embeddings should be calculated. Allowed values: 1 <= x
emb_layer_maxint Maximal layer from which the embeddings should be calculated. Allowed values: 1 <= x
emb_pool_typestring Method to summarize the embedding of single tokens into a text embedding.
In the case of 'CLS' all cls-tokens between emb_layer_min and emb_layer_max are averaged.
In the case of 'Average' the embeddings of all tokens are averaged.
Please note that BaseModelFunnel allows only 'CLS'. Allowed values: 'CLS', 'Average'
pad_valueint Value indicating padding. This value should no be in the range of
regluar values for computations. Thus it is not recommended to chance this value.
Default is -100. Allowed values: x <= -1
base_modelBaseModelCore BaseModels for processing raw texts.
tracebool TRUE if information about the estimation phase should be printed to the console.
Does nothing return.
load_from_disk()Loads an object from disk and updates the object to the current version of the package.
TextEmbeddingModel$load_from_disk(dir_path)
dir_pathPath where the object set is stored.
Function does nothin return. It loads an object from disk.
save()Method for saving a model on disk.
TextEmbeddingModel$save(dir_path, folder_name)
dir_pathPath to the directory where to save the object.
folder_namestring Name of the folder where the model should be saved. Allowed values: any
Function does nothing return. It is used to save an object on disk.
encode()Method for encoding words of raw texts into integers.
TextEmbeddingModel$encode( raw_text, token_encodings_only = FALSE, token_to_int = TRUE, trace = FALSE )
raw_textvector Raw text.
token_encodings_onlybool
TRUE: Returns a list containg only the tokens.
FALSE: Returns a list containg a list for the tokens, the number of chunks, and
the number potential number of chunks for each document/text.
token_to_intbool
TRUE: Returns the tokens as int index.
FALSE: Returns the tokens as strings.
tracebool TRUE if information about the estimation phase should be printed to the console.
list containing the integer or token sequences of the raw texts with
special tokens.
decode()Method for decoding a sequence of integers into tokens
TextEmbeddingModel$decode(int_seqence, to_token = FALSE)
int_seqencelist list of integer sequence that should be converted to tokens.
to_tokenbool
FALSE: Transforms the integers to plain text.
TRUE: Transforms the integers to a sequence of tokens.
list of token sequences
embed()Method for creating text embeddings from raw texts.
This method should only be used if a small number of texts should be transformed
into text embeddings. For a large number of texts please use the method embed_large.
TextEmbeddingModel$embed( raw_text = NULL, doc_id = NULL, batch_size = 8L, trace = FALSE, return_large_dataset = FALSE )
raw_textvector Raw text.
doc_idvector Id for every text.
batch_sizeint Size of the batches for training. Allowed values: 1 <= x
tracebool TRUE if information about the estimation phase should be printed to the console.
return_large_datasetbool If TRUE a LargeDataSetForTextEmbeddings is returned. If FALSE an object if class EmbeddedText is returned.
Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.
embed_large()Method for creating text embeddings from raw texts.
TextEmbeddingModel$embed_large( text_dataset, batch_size = 32L, trace = FALSE, log_file = NULL, log_write_interval = 2L )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
batch_sizeint Size of the batches for training. Allowed values: 1 <= x
tracebool TRUE if information about the estimation phase should be printed to the console.
log_filestring Path to the file where the log files should be saved.
If no logging is desired set this argument to NULL. Allowed values: any
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir is not NULL. Allowed values: 1 <= x
Method returns an object of class LargeDataSetForTextEmbeddings.
get_n_features()Method for requesting the number of features.
TextEmbeddingModel$get_n_features()
Returns a double which represents the number of features. This number represents the
hidden size of the embeddings for every chunk or time.
get_pad_value()Value for indicating padding.
TextEmbeddingModel$get_pad_value()
Returns an int describing the value used for padding.
set_publication_info()Method for setting the bibliographic information of the model.
TextEmbeddingModel$set_publication_info(type, authors, citation, url = NULL)
typestring Type of information which should be changed/added.
developer, and modifier are possible.
authorsList of people.
citationstring Citation in free text.
urlstring Corresponding URL if applicable.
Function does not return a value. It is used to set the private members for publication information of the model.
get_sustainability_data()Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
TextEmbeddingModel$get_sustainability_data(track_mode = "training")
track_modestring Determines the stept to which the data refer. Allowed values: 'training', 'inference'
Returns a list containing the tracked energy consumption, CO2 equivalents in kg, information on the
tracker used, and technical information on the training infrastructure.
estimate_sustainability_inference_embed()Calculates the energy consumption for inference of the given task.
TextEmbeddingModel$estimate_sustainability_inference_embed( text_dataset = NULL, batch_size = 32L, sustain_iso_code = NULL, sustain_region = NULL, sustain_interval = 10L, sustain_log_level = "warning", trace = TRUE )
text_datasetLargeDataSetForText LargeDataSetForText Object storing textual data.
batch_sizeint Size of the batches for training. Allowed values: 1 <= x
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_regionstring Region within a country. Only available for USA and Canada See the documentation of
codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any
sustain_intervalint Interval in seconds for measuring power usage. Allowed values: 1 <= x
sustain_log_levelstring Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
tracebool TRUE if information about the estimation phase should be printed to the console.
Returns nothing. Method saves the statistics internally.
The statistics can be accessed with the method get_sustainability_data("inference")
clone()The objects of this class are cloneable with this method.
TextEmbeddingModel$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Text Embedding:
TEFeatureExtractor
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.