View source: R/7_1_textTopics.R
| textTopics | R Documentation |
textTopics() trains a BERTopic model (via the bertopic Python package) on a
text variable in a tibble/data.frame. The function embeds documents, reduces
dimensionality (UMAP), clusters documents (HDBSCAN), and extracts topic representations
using c-TF-IDF with optional KeyBERT/MMR-based representation. (EXPERIMENTAL)
textTopics(
data,
variable_name,
embedding_model = "distilroberta",
representation_model = c("mmr", "keybert"),
umap_n_neighbors = 15L,
umap_n_components = 5L,
umap_min_dist = 0,
umap_metric = "cosine",
hdbscan_min_cluster_size = 5L,
hdbscan_min_samples = NULL,
hdbscan_metric = "euclidean",
hdbscan_cluster_selection_method = "eom",
hdbscan_prediction_data = TRUE,
num_top_words = 10L,
n_gram_range = c(1L, 3L),
stopwords = "english",
min_df = 5L,
bm25_weighting = FALSE,
reduce_frequent_words = TRUE,
set_seed = 8L,
save_dir
)
data |
A |
variable_name |
A character string giving the name of the text variable in |
embedding_model |
A character string specifying which embedding model to use.
Common options include |
representation_model |
A character string specifying the topic representation method.
Must be one of
|
umap_n_neighbors |
Integer. Number of neighbors used by UMAP to balance local versus global structure. Smaller values emphasize local clusters; larger values emphasize global structure. |
umap_n_components |
Integer. Number of dimensions to reduce to with UMAP (the embedding space used for clustering). |
umap_min_dist |
Numeric. Minimum distance between embedded points in UMAP. Smaller values typically yield tighter clusters. |
umap_metric |
Character string specifying the distance metric used by UMAP, e.g.
|
hdbscan_min_cluster_size |
Integer. The minimum cluster size for HDBSCAN. Larger values yield fewer, broader topics; smaller values yield more, finer-grained topics. |
hdbscan_min_samples |
Integer or |
hdbscan_metric |
Character string specifying the metric used by HDBSCAN, typically
|
hdbscan_cluster_selection_method |
Character string specifying cluster selection strategy.
Either |
hdbscan_prediction_data |
Logical. If |
num_top_words |
Integer. Number of top terms to return per topic. |
n_gram_range |
Integer vector of length 2 giving the min and max n-gram length used by
the vectorizer (e.g., |
stopwords |
Character string naming the stopword dictionary to use (e.g. |
min_df |
Integer. Minimum document frequency for terms included in the vectorizer. |
bm25_weighting |
Logical. If |
reduce_frequent_words |
Logical. If |
set_seed |
Integer. Random seed used to initialize UMAP (and other stochastic components) for reproducibility. |
save_dir |
Character string specifying the directory where outputs should be saved. A folder will be created (or reused) to store the fitted model and derived outputs. |
Typical tuning levers:
More topics / finer clusters: decrease hdbscan_min_cluster_size,
decrease umap_n_neighbors, and/or increase umap_n_components.
Fewer topics / broader clusters: increase hdbscan_min_cluster_size
and/or increase umap_n_neighbors.
More phrase-like terms: increase n_gram_range max (e.g., up to 3).
Cleaner vocabulary: increase min_df, and use reduce_frequent_words = TRUE.
A named list containing:
The training data used to fit the model (or loaded from disk if available).
A document-by-topic matrix of normalized topic mixtures (LDA-like). Rows typically sum to 1; rows of zeros can occur if no topic mass was assigned.
Document-level outputs including hard topic labels (-1 indicates outliers).
Topic-level outputs including topic sizes and top terms.
The fitted BERTopic model object (Python-backed).
Model identifier (currently "bert_topic").
Random seed used.
Directory where artifacts were saved.
textTopicsReduce, textTopicsTest,
textTopicsWordcloud
## Not run:
res <- textTopics(
data = Language_based_assessment_data_8,
variable_name = "harmonytexts",
embedding_model = "distilroberta",
representation_model = "mmr",
min_df = 3,
save_dir = "bertopic_results"
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.