train_tfidf_model: Train a TF-IDF Model (for Training Phase)
In text2emotion: Emotion Analysis and Emoji Mapping for Text

View source: R/TF-IDF_train.R

train_tfidf_model

R Documentation

Train a TF-IDF Model (for Training Phase)

Description

Train a TF-IDF model with customizable tokenization and vocabulary pruning.

Usage

train_tfidf_model(
  preprocessed_text,
  max_features = 10000,
  min_df = 2,
  max_df = 0.8
)

Arguments

`preprocessed_text`	A character vector containing the preprocessed text.
`max_features`	The maximum number of features (terms) to include in the vocabulary. Default is 10000.
`min_df`	Minimum document frequency for terms. Default is 2 (terms must appear in at least 2 documents).
`max_df`	Maximum document frequency as a proportion of documents. Default is 0.8 (terms must appear in less than 80% of documents).

Details

This function performs the following steps:

1. Tokenizes the preprocessed text into words and removes stopwords. 2. Defines custom stopwords and retains important emotional function words. 3. Creates a vocabulary based on unigrams and trigrams, pruning terms based on document frequency and term count. 4. Builds the TF-IDF sparse matrix for the input text.

Value

A list with the following components:

tfidf_model: The trained TF-IDF model object.
vectorizer: The vocabulary vectorizer used in training.
tfidf_matrix: The TF-IDF sparse matrix representing the text data.

Examples

preprocessed_text <- c("I'm feeling so happy today!", "I feel really excited and hopeful!")
result <- train_tfidf_model(preprocessed_text)
result$tfidf_model  # Access the trained TF-IDF model

text2emotion documentation built on June 8, 2025, 1:04 p.m.