filter_tf_idf: Remove Words Below a TF-IDF Threshold from a...

Description Usage Arguments Value Author(s) References Examples

Description

Remove words from a TermDocumentMatrix or DocumentTermMatrix not meeting a tf-idf threshold. Code is based on Gruen & Hornik's (2011) code but allows for easier chaining and extends the filtering to a TermDocumentMatrix. This can be used to remove words that appear too frequently in a corpus, therefore these words do not carry much information.

Usage

1
filter_tf_idf(x, min = NULL, verbose = FALSE)

Arguments

x

A TermDocumentMatrix or DocumentTermMatrix.

min

A minimal threshold that a word tf-idf must exceed. If min = NULL the median of the tf-idf will be used.

verbose

logical. If TRUE the summary stats from the tf-idf are printed. This can be useful for exploration and setting the min value.

Value

Returns a TermDocumentMatrix or DocumentTermMatrix.

Author(s)

Bettina Gr\"un, Kurt Hornik, and Tyler Rinker <tyler.rinker@gmail.com>.

References

Bettina Gruen & Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13), 1-30. http://www.jstatsoft.org/article/view/v040i13/v40i13.pdf

Examples

1
2
3
4
5
6
(x <-with(presidential_debates_2012, q_dtm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(x)
filter_tf_idf(x, .5)
filter_tf_idf(x, verbose=TRUE)
(y <- with(presidential_debates_2012, q_tdm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(y)

trinker/gofastr documentation built on May 31, 2019, 8:43 p.m.