dtm_remove_tfidf | R Documentation |
Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency.
Either giving in the maximum number of terms (argument top
), the tfidf cutoff (argument cutoff
)
or a quantile (argument prob
)
dtm_remove_tfidf(dtm, top, cutoff, prob, remove_emptydocs = TRUE)
dtm |
an object returned by |
top |
integer with the number of terms which should be kept as defined by the highest mean tfidf |
cutoff |
numeric cutoff value to keep only terms in |
prob |
numeric quantile indicating to keep only terms in |
remove_emptydocs |
logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to |
a sparse Matrix as returned by sparseMatrix
where terms with high tfidf are kept and documents without any remaining terms are removed
data(brussels_reviews_anno) x <- subset(brussels_reviews_anno, xpos == "NN") x <- x[, c("doc_id", "lemma")] x <- document_term_frequencies(x) dtm <- document_term_matrix(x) dtm <- dtm_remove_lowfreq(dtm, minfreq = 10) dim(dtm) ## Keep only terms with high tfidf x <- dtm_remove_tfidf(dtm, top=50) dim(x) x <- dtm_remove_tfidf(dtm, top=50, remove_emptydocs = FALSE) dim(x) ## Keep only terms with tfidf above 1.1 x <- dtm_remove_tfidf(dtm, cutoff=1.1) dim(x) ## Keep only terms with tfidf above the 60 percent quantile x <- dtm_remove_tfidf(dtm, prob=0.6) dim(x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.