dfm_tfidf | R Documentation |
Weight a dfm by term frequency-inverse document frequency (tf-idf), with full control over options. Uses fully sparse methods for efficiency.
dfm_tfidf(
x,
scheme_tf = "count",
scheme_df = "inverse",
base = 10,
force = FALSE,
...
)
x |
object for which idf or tf-idf will be computed (a document-feature matrix) |
scheme_tf |
scheme for |
scheme_df |
scheme for |
base |
the base for the logarithms in the |
force |
logical; if |
... |
additional arguments passed to |
dfm_tfidf
computes term frequency-inverse document frequency
weighting. The default is to use counts instead of normalized term
frequency (the relative term frequency within document), but this
can be overridden using scheme_tf = "prop"
.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
dfmat1 <- as.dfm(data_dfm_lbgexample)
head(dfmat1[, 5:10])
head(dfm_tfidf(dfmat1)[, 5:10])
docfreq(dfmat1)[5:15]
head(dfm_weight(dfmat1)[, 5:10])
# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
dfmat2 <-
matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
byrow = TRUE, nrow = 2,
dimnames = list(docs = c("document1", "document2"),
features = c("this", "is", "a", "sample",
"another", "example"))) |>
as.dfm()
dfmat2
docfreq(dfmat2)
dfm_tfidf(dfmat2, scheme_tf = "prop") |> round(digits = 2)
## Not run:
# comparison with tm
if (requireNamespace("tm")) {
convert(dfmat2, to = "tm") |> tm::weightTfIdf() |> as.matrix()
# same as:
dfm_tfidf(dfmat2, base = 2, scheme_tf = "prop")
}
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.