filter_tf_idf: Remove Words Below a TF-IDF Threshold from a...

Description Usage Arguments Value Author(s) References Examples

Description

Remove words from a TermDocumentMatrix or DocumentTermMatrix not meeting a tf-idf threshold. Code is based on Gruen & Hornik's (2011) code but allows for easier chaining and extends the filtering to a TermDocumentMatrix. This can be used to remove words that appear too frequently in a corpus, therefore these words do not carry much information.

Usage

1
filter_tf_idf(x, min = NULL, verbose = FALSE)

Arguments

x

A TermDocumentMatrix or DocumentTermMatrix.

min

A minimal threshold that a word tf-idf must exceed. If min = NULL the median of the tf-idf will be used.

verbose

logical. If TRUE the summary stats from the tf-idf are printed. This can be useful for exploration and setting the min value.

Value

Returns a TermDocumentMatrix or DocumentTermMatrix.

Author(s)

Bettina Gr\"un, Kurt Hornik, and Tyler Rinker <tyler.rinker@gmail.com>.

References

Bettina Gruen & Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13), 1-30. http://www.jstatsoft.org/article/view/v040i13/v40i13.pdf

Examples

1
2
3
4
5
6
(x <-with(presidential_debates_2012, q_dtm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(x)
filter_tf_idf(x, .5)
filter_tf_idf(x, verbose=TRUE)
(y <- with(presidential_debates_2012, q_tdm(dialogue, paste(person, time, sep = "_"))))
filter_tf_idf(y)

Example output

<<DocumentTermMatrix (documents: 10, terms: 3377)>>
Non-/sparse entries: 8364/25406
Sparsity           : 75%
Maximal term length: 16
Weighting          : term frequency (tf)
Warning message:
Argument removeNumbers not used. 
<<DocumentTermMatrix (documents: 10, terms: 1689)>>
Non-/sparse entries: 4024/12866
Sparsity           : 76%
Maximal term length: 16
Weighting          : term frequency (tf)
<<DocumentTermMatrix (documents: 10, terms: 0)>>
Non-/sparse entries: 0/0
Sparsity           : 100%
Maximal term length: 0
Weighting          : term frequency (tf)
Summary stats for the tf-idf:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0003729 0.0004152 0.0007531 0.0008198 0.0150768 

0.00042 used for `min`

<<DocumentTermMatrix (documents: 10, terms: 1689)>>
Non-/sparse entries: 4024/12866
Sparsity           : 76%
Maximal term length: 16
Weighting          : term frequency (tf)
<<TermDocumentMatrix (terms: 3377, documents: 10)>>
Non-/sparse entries: 8364/25406
Sparsity           : 75%
Maximal term length: 16
Weighting          : term frequency (tf)
Warning message:
Argument removeNumbers not used. 
<<TermDocumentMatrix (terms: 1689, documents: 10)>>
Non-/sparse entries: 4024/12866
Sparsity           : 76%
Maximal term length: 16
Weighting          : term frequency (tf)

gofastr documentation built on May 2, 2019, 5:39 a.m.