calc_tfidf_ngrams: Calculate TF-IDFs for unigrams or bigrams
In CDU-data-science-team/experienceAnalysis: Helper Functions for Text Mining

Description Usage Arguments Value Note References Examples

For a given labelled text, return the unigrams or bigrams with the largest TF-IDFs for the given class(es).

calc_tfidf_ngrams(
  x,
  target_col_name,
  text_col_name,
  filter_class = NULL,
  ngrams_type = c("Unigrams", "Bigrams"),
  number_of_ngrams = NULL
)

`x`	A data frame with two columns: the column with the classes; and the column with the text.
`target_col_name`	A string with the column name of the target variable. It is equivalent to argument `document` in `bind_tf_idf{tidytext}`.
`text_col_name`	A string with the column name of the text variable.
`filter_class`	A string or vector of strings with the name(s) of the class(es) for which TF-IDFs are to be calculated. Defaults to `NULL` (all classes).
`ngrams_type`	A string. Should be "Unigrams" for unigrams and "Bigrams" for bigrams.
`number_of_ngrams`	Integer. Number of ngrams to return. Defaults to all.

A data frame with six columns: class; n-gram (word or bigram); count; term-frequency; inverse document frequency; and TF-IDF.

Unlike other functions in experienceAnalysis (e.g. calc_net_sentiment_nrc), here it does not make sense to have target_col_name set to NULL- the TF-IDF of an n-gram depends on the number of "documents" containing it (see Silge and Robinson, 2017), so there must be at least two classes (or "documents") to use in the calculations.

When filter_class is not NULL, the TF-IDFs will still be calculated using all classes/documents and then filtered by filter_class.

Silge J. & Robinson D. (2017). Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly Media. ISBN 978-1-491-98165-8.

library(experienceAnalysis)
books <- janeaustenr::austen_books() # Jane Austen books
emma <- paste(books[books$book == "Emma", ], collapse = " ") # String with whole book
pp <- paste(books[books$book == "Pride & Prejudice", ], collapse = " ") # String with whole book

# Make data frame with books Emma and Pride & Prejudice
x <- data.frame(
  text = c(emma, pp),
  book = c("Emma", "Pride & Prejudice")
)

# Get a few of the bigram counts, TFs, IDFs and highest TF-IDFs for each book
calc_tfidf_ngrams(x, target_col_name = "book", text_col_name = "text",
                  filter_class = NULL,
                  ngrams_type = "Bigrams",
                  number_of_ngrams = 30
) %>%
split(.$book)