Description Usage Arguments Value Author(s) References Examples
Remove words from a TermDocumentMatrix
or DocumentTermMatrix
not meeting a tf-idf threshold. Code
is based on Gruen & Hornik's (2011) code but allows for easier chaining and
extends the filtering to a TermDocumentMatrix
. This can be
used to remove words that appear too frequently in a corpus, therefore these
words do not carry much information.
1 | filter_tf_idf(x, min = NULL, verbose = FALSE)
|
x |
A |
min |
A minimal threshold that a word tf-idf must exceed. If |
verbose |
logical. If |
Returns a TermDocumentMatrix
or DocumentTermMatrix
.
Bettina Gr\"un, Kurt Hornik, and Tyler Rinker <tyler.rinker@gmail.com>.
Bettina Gruen & Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13), 1-30. http://www.jstatsoft.org/article/view/v040i13/v40i13.pdf
1 2 3 4 5 6 | (x <-with(presidential_debates_2012, q_dtm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(x)
filter_tf_idf(x, .5)
filter_tf_idf(x, verbose=TRUE)
(y <- with(presidential_debates_2012, q_tdm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(y)
|
<<DocumentTermMatrix (documents: 10, terms: 3377)>>
Non-/sparse entries: 8364/25406
Sparsity : 75%
Maximal term length: 16
Weighting : term frequency (tf)
Warning message:
Argument removeNumbers not used.
<<DocumentTermMatrix (documents: 10, terms: 1689)>>
Non-/sparse entries: 4024/12866
Sparsity : 76%
Maximal term length: 16
Weighting : term frequency (tf)
<<DocumentTermMatrix (documents: 10, terms: 0)>>
Non-/sparse entries: 0/0
Sparsity : 100%
Maximal term length: 0
Weighting : term frequency (tf)
Summary stats for the tf-idf:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000000 0.0003729 0.0004152 0.0007531 0.0008198 0.0150768
0.00042 used for `min`
<<DocumentTermMatrix (documents: 10, terms: 1689)>>
Non-/sparse entries: 4024/12866
Sparsity : 76%
Maximal term length: 16
Weighting : term frequency (tf)
<<TermDocumentMatrix (terms: 3377, documents: 10)>>
Non-/sparse entries: 8364/25406
Sparsity : 75%
Maximal term length: 16
Weighting : term frequency (tf)
Warning message:
Argument removeNumbers not used.
<<TermDocumentMatrix (terms: 1689, documents: 10)>>
Non-/sparse entries: 4024/12866
Sparsity : 76%
Maximal term length: 16
Weighting : term frequency (tf)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.