Description Usage Arguments Value Note References Examples
View source: R/calc_tfidf_ngrams.R
For a given labelled text, return the unigrams or bigrams with the largest TF-IDFs for the given class(es).
1 2 3 4 5 6 7 8 | calc_tfidf_ngrams(
x,
target_col_name,
text_col_name,
filter_class = NULL,
ngrams_type = c("Unigrams", "Bigrams"),
number_of_ngrams = NULL
)
|
x |
A data frame with two columns: the column with the classes; and the column with the text. |
target_col_name |
A string with the column name of the target variable.
It is equivalent to argument |
text_col_name |
A string with the column name of the text variable. |
filter_class |
A string or vector of strings with the name(s) of the
class(es) for which TF-IDFs are to be calculated. Defaults to
|
ngrams_type |
A string. Should be "Unigrams" for unigrams and "Bigrams" for bigrams. |
number_of_ngrams |
Integer. Number of ngrams to return. Defaults to all. |
A data frame with six columns: class; n-gram (word or bigram); count; term-frequency; inverse document frequency; and TF-IDF.
Unlike other functions in experienceAnalysis
(e.g.
calc_net_sentiment_nrc
), here it does not make
sense to have target_col_name
set to NULL
- the TF-IDF of an n-gram
depends on the number of "documents" containing it (see Silge and
Robinson, 2017), so there must be at least two classes (or "documents")
to use in the calculations.
When filter_class
is not NULL
, the TF-IDFs will still be
calculated using all classes/documents and then filtered by
filter_class
.
Silge J. & Robinson D. (2017). Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly Media. ISBN 978-1-491-98165-8.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | library(experienceAnalysis)
books <- janeaustenr::austen_books() # Jane Austen books
emma <- paste(books[books$book == "Emma", ], collapse = " ") # String with whole book
pp <- paste(books[books$book == "Pride & Prejudice", ], collapse = " ") # String with whole book
# Make data frame with books Emma and Pride & Prejudice
x <- data.frame(
text = c(emma, pp),
book = c("Emma", "Pride & Prejudice")
)
# Get a few of the bigram counts, TFs, IDFs and highest TF-IDFs for each book
calc_tfidf_ngrams(x, target_col_name = "book", text_col_name = "text",
filter_class = NULL,
ngrams_type = "Bigrams",
number_of_ngrams = 30
) %>%
split(.$book)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.