textEmbed: textEmbed() extracts layers and aggregate them to word...
In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textEmbed

R Documentation

textEmbed() extracts layers and aggregate them to word embeddings, for all character variables in a given dataframe.

Description

textEmbed() extracts layers and aggregate them to word embeddings, for all character variables in a given dataframe.

Usage

textEmbed(
  texts,
  model = "bert-base-uncased",
  layers = -2,
  dim_name = TRUE,
  aggregation_from_layers_to_tokens = "concatenate",
  aggregation_from_tokens_to_texts = "mean",
  aggregation_from_tokens_to_word_types = NULL,
  keep_token_embeddings = TRUE,
  batch_size = 100,
  remove_non_ascii = TRUE,
  tokens_select = NULL,
  tokens_deselect = NULL,
  decontextualize = FALSE,
  model_max_length = NULL,
  max_token_to_sentence = 4,
  tokenizer_parallelism = FALSE,
  device = "cpu",
  hg_gated = FALSE,
  hg_token = Sys.getenv("HUGGINGFACE_TOKEN", unset = ""),
  logging_level = "error",
  implementation = "original",
  trust_remote_code = FALSE,
  ...
)

Arguments

`texts`	A character variable or a tibble/dataframe with at least one character variable.
`model`	Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base". Only load models that you trust from HuggingFace; loading a malicious model can execute arbitrary code on your computer).
`layers`	(string or numeric) Specify the layers that should be extracted (default -2 which give the second to last layer). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.
`dim_name`	(boolean) If TRUE append the variable name after all variable-names in the output. (This differentiates between word embedding dimension names; e.g., Dim1_text_variable_name). see `textDimName` to change names back and forth.
`aggregation_from_layers_to_tokens`	(string) Aggregated layers of each token. Method to aggregate the contextualized layers (e.g., "mean", "min" or "max, which takes the minimum, maximum or mean, respectively, across each column; or "concatenate", which links together each word embedding layer to one long row.
`aggregation_from_tokens_to_texts`	(string) Method to carry out the aggregation among the word embeddings for the words/tokens, including "min", "max" and "mean" which takes the minimum, maximum or mean across each column; or "concatenate", which links together each layer of the word embedding to one long row (default = "mean"). If set to NULL, embeddings are not aggregated.
`aggregation_from_tokens_to_word_types`	(string) Aggregates to the word type (i.e., the individual words) rather than texts. If set to "individually", then duplicate words are not aggregated, (i.e, the context of individual is preserved). (default = NULL).
`keep_token_embeddings`	(boolean) Whether to also keep token embeddings when using texts or word types aggregation.
`batch_size`	Number of rows in each batch
`remove_non_ascii`	(bolean) TRUE warns and removes non-ascii (using textFindNonASCII()).
`tokens_select`	Option to select word embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.
`tokens_deselect`	Option to deselect embeddings linked to specific tokens such as [CLS] and [SEP] for the context embeddings.
`decontextualize`	(boolean) Provide word embeddings of single words as input to the model (these embeddings are, e.g., used for plotting; default is to use ). If using this, then set single_context_embeddings to FALSE.
`model_max_length`	The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
`max_token_to_sentence`	(numeric) Maximum number of tokens in a string to handle before switching to embedding text sentence by sentence.
`tokenizer_parallelism`	(boolean) If TRUE this will turn on tokenizer parallelism. Default FALSE.
`device`	Name of device to use: 'cpu', 'gpu', 'gpu:k' or 'mps'/'mps:k' for MacOS, where k is a specific device number such as 'mps:1'.
`hg_gated`	Set to TRUE if the accessed model is gated.
`hg_token`	The token needed to access the gated model. Create a token from the ['Settings' page](https://huggingface.co/settings/tokens) of the Hugging Face website. An an environment variable HUGGINGFACE_TOKEN can be set to avoid the need to enter the token each time.
`logging_level`	Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug
`implementation`	(boolean; experiments) If TRUE the text is split using the DLATK-method; this method appears better for longer texts (but it does not return token level word embeddings, nor word_types embeddings at this stage).
`trust_remote_code`	(boolean) use a model with custom code on the Huggingface Hub
`...`	settings from textEmbedRawLayers().

Value

A tibble with tokens, a column for layer identifier and word embeddings. Note that layer 0 is the input embedding to the transformer.

Examples

# Automatically transforms the characters in the example dataset:
# Language_based_assessment_data_8 (included in text-package), to embeddings.
## Not run: 
word_embeddings <- textEmbed(Language_based_assessment_data_8[1:2, 1:2],
  layers = 10:11,
  aggregation_from_layers_to_tokens = "concatenate",
  aggregation_from_tokens_to_texts = "mean",
  aggregation_from_tokens_to_word_types = "mean"
)

# Show information about how the embeddings were constructed.
comment(word_embeddings$texts$satisfactiontexts)
comment(word_embeddings$word_types)
comment(word_embeddings$tokens$satisfactiontexts)

# See how the word embeddings are structured.
word_embeddings

# Save the word embeddings to avoid having to embed the text again.
saveRDS(word_embeddings, "word_embeddings.rds")

# Retrieve the saved word embeddings.
word_embeddings <- readRDS("word_embeddings.rds")

## End(Not run)

text documentation built on June 8, 2025, 1:32 p.m.

text index

README.md Creating a Singularity Container to Run HuggingFace Transformers Models in R Extended Installation Guide Getting started How to best manage computationally heavy analyses HuggingFace language models are downloaded in .cache HuggingFace Transformers in R: Word Embeddings Defaults and Specifications Implicit Motives Tutorial L-BAM Tutorial Pre-registration and Researcher Degrees of Freedom Psychological Methods: the Text Tutorial The Language-Based Assessment Model (L-BAM) Library

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

text
Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textEmbed: textEmbed() extracts layers and aggregate them to word...
In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textEmbed() extracts layers and aggregate them to word embeddings, for all character variables in a given dataframe.

Description

Usage

Arguments

Value

See Also

Examples

Related to textEmbed in text...

R Package Documentation

Browse R Packages

We want your feedback!

text Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textEmbed: textEmbed() extracts layers and aggregate them to word... In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textEmbed() extracts layers and aggregate them to word embeddings, for all character variables in a given dataframe.

Description

Usage

Arguments

Value

See Also

Examples

Related to textEmbed in text...

R Package Documentation

Browse R Packages

We want your feedback!

text
Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textEmbed: textEmbed() extracts layers and aggregate them to word...
In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning