textTopics: BERTopic
In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textTopics

R Documentation

BERTopic

Description

textTopics() trains a BERTopic model (via the bertopic Python package) on a text variable in a tibble/data.frame. The function embeds documents, reduces dimensionality (UMAP), clusters documents (HDBSCAN), and extracts topic representations using c-TF-IDF with optional KeyBERT/MMR-based representation. (EXPERIMENTAL)

Usage

textTopics(
  data,
  variable_name,
  embedding_model = "distilroberta",
  representation_model = c("mmr", "keybert"),
  umap_n_neighbors = 15L,
  umap_n_components = 5L,
  umap_min_dist = 0,
  umap_metric = "cosine",
  hdbscan_min_cluster_size = 5L,
  hdbscan_min_samples = NULL,
  hdbscan_metric = "euclidean",
  hdbscan_cluster_selection_method = "eom",
  hdbscan_prediction_data = TRUE,
  num_top_words = 10L,
  n_gram_range = c(1L, 3L),
  stopwords = "english",
  min_df = 5L,
  bm25_weighting = FALSE,
  reduce_frequent_words = TRUE,
  set_seed = 8L,
  save_dir
)

Arguments

`data`	A `tibble`/`data.frame` containing a text variable to analyze and, optionally, additional numeric/categorical variables that can be used in later analyses (e.g., testing topic prevalence differences across groups).
`variable_name`	A character string giving the name of the text variable in `data` to perform topic modeling on.
`embedding_model`	A character string specifying which embedding model to use. Common options include `"miniLM"`, `"mpnet"`, `"multi-mpnet"`, and `"distilroberta"`. The choice affects topic quality, speed, and memory usage.
`representation_model`	A character string specifying the topic representation method. Must be one of `"mmr"` or `"keybert"`. `"keybert"` uses embedding similarity to select representative words/phrases. `"mmr"` (Maximal Marginal Relevance) promotes diversity among selected terms.
`umap_n_neighbors`	Integer. Number of neighbors used by UMAP to balance local versus global structure. Smaller values emphasize local clusters; larger values emphasize global structure.
`umap_n_components`	Integer. Number of dimensions to reduce to with UMAP (the embedding space used for clustering).
`umap_min_dist`	Numeric. Minimum distance between embedded points in UMAP. Smaller values typically yield tighter clusters.
`umap_metric`	Character string specifying the distance metric used by UMAP, e.g. `"cosine"`.
`hdbscan_min_cluster_size`	Integer. The minimum cluster size for HDBSCAN. Larger values yield fewer, broader topics; smaller values yield more, finer-grained topics.
`hdbscan_min_samples`	Integer or `NULL`. Controls how conservative clustering is. If `NULL`, HDBSCAN chooses a default.
`hdbscan_metric`	Character string specifying the metric used by HDBSCAN, typically `"euclidean"` when clustering in reduced UMAP space.
`hdbscan_cluster_selection_method`	Character string specifying cluster selection strategy. Either `"eom"` (excess of mass; often yields more stable clusters) or `"leaf"` (can yield more fine-grained clusters).
`hdbscan_prediction_data`	Logical. If `TRUE`, stores additional information enabling approximate topic prediction for new documents (when supported by the underlying pipeline).
`num_top_words`	Integer. Number of top terms to return per topic.
`n_gram_range`	Integer vector of length 2 giving the min and max n-gram length used by the vectorizer (e.g., `c(1L, 3L)`).
`stopwords`	Character string naming the stopword dictionary to use (e.g. `"english"`).
`min_df`	Integer. Minimum document frequency for terms included in the vectorizer.
`bm25_weighting`	Logical. If `TRUE`, uses BM25 weighting in the class-based TF-IDF transformer (can improve term weighting in some corpora).
`reduce_frequent_words`	Logical. If `TRUE`, down-weights very frequent words using the class-based TF-IDF transformer.
`set_seed`	Integer. Random seed used to initialize UMAP (and other stochastic components) for reproducibility.
`save_dir`	Character string specifying the directory where outputs should be saved. A folder will be created (or reused) to store the fitted model and derived outputs.

Details

Typical tuning levers:

More topics / finer clusters: decrease hdbscan_min_cluster_size, decrease umap_n_neighbors, and/or increase umap_n_components.
Fewer topics / broader clusters: increase hdbscan_min_cluster_size and/or increase umap_n_neighbors.
More phrase-like terms: increase n_gram_range max (e.g., up to 3).
Cleaner vocabulary: increase min_df, and use reduce_frequent_words = TRUE.

Value

A named list containing:

train_data: The training data used to fit the model (or loaded from disk if available).
preds: A document-by-topic matrix of normalized topic mixtures (LDA-like). Rows typically sum to 1; rows of zeros can occur if no topic mass was assigned.
doc_info: Document-level outputs including hard topic labels (-1 indicates outliers).
topic_info: Topic-level outputs including topic sizes and top terms.
model: The fitted BERTopic model object (Python-backed).
model_type: Model identifier (currently "bert_topic").
seed: Random seed used.
save_dir: Directory where artifacts were saved.

Examples

## Not run: 
res <- textTopics(
  data = Language_based_assessment_data_8,
  variable_name = "harmonytexts",
  embedding_model = "distilroberta",
  representation_model = "mmr",
  min_df = 3,
  save_dir = "bertopic_results"
)


## End(Not run)

text documentation built on Feb. 16, 2026, 5:10 p.m.

text index

README.md Creating a Singularity Container to Run HuggingFace Transformers Models in R Extended Installation Guide Getting started How to best manage computationally heavy analyses HuggingFace language models are downloaded in .cache HuggingFace Transformers in R: Word Embeddings Defaults and Specifications Implicit Motives Tutorial Installing and Managing Python Environments with `reticulate` L-BAM Tutorial Pre-registration and Researcher Degrees of Freedom Psychological Methods: the Text Tutorial The Language-Based Assessment Model (L-BAM) Library

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

text
Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textTopics: BERTopic
In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

BERTopic

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to textTopics in text...

R Package Documentation

Browse R Packages

We want your feedback!

text Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textTopics: BERTopic In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

BERTopic

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to textTopics in text...

R Package Documentation

Browse R Packages

We want your feedback!

text
Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

textTopics: BERTopic
In text: Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning