train_tfidf_model | R Documentation |
Train a TF-IDF model with customizable tokenization and vocabulary pruning.
train_tfidf_model(
preprocessed_text,
max_features = 10000,
min_df = 2,
max_df = 0.8
)
preprocessed_text |
A character vector containing the preprocessed text. |
max_features |
The maximum number of features (terms) to include in the vocabulary. Default is 10000. |
min_df |
Minimum document frequency for terms. Default is 2 (terms must appear in at least 2 documents). |
max_df |
Maximum document frequency as a proportion of documents. Default is 0.8 (terms must appear in less than 80% of documents). |
This function performs the following steps:
1. Tokenizes the preprocessed text into words and removes stopwords. 2. Defines custom stopwords and retains important emotional function words. 3. Creates a vocabulary based on unigrams and trigrams, pruning terms based on document frequency and term count. 4. Builds the TF-IDF sparse matrix for the input text.
A list with the following components:
The trained TF-IDF model object.
The vocabulary vectorizer used in training.
The TF-IDF sparse matrix representing the text data.
preprocessed_text <- c("I'm feeling so happy today!", "I feel really excited and hopeful!")
result <- train_tfidf_model(preprocessed_text)
result$tfidf_model # Access the trained TF-IDF model
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.