dtm_remove_lowfreq: Remove terms occurring with low frequency from a...

View source: R/nlp_flow.R

dtm_remove_lowfreqR Documentation

Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms

Description

Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms

Usage

dtm_remove_lowfreq(dtm, minfreq = 5, maxterms, remove_emptydocs = TRUE)

Arguments

dtm

an object returned by document_term_matrix

minfreq

integer with the minimum number of times the term should occur in order to keep the term

maxterms

integer indicating the maximum number of terms which should be kept in the dtm. The argument is optional.

remove_emptydocs

logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to TRUE.

Value

a sparse Matrix as returned by sparseMatrix where terms with low occurrence are removed and documents without any terms are also removed

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, xpos == "NN")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)


## Remove terms with low frequencies and documents with no terms
x <- dtm_remove_lowfreq(dtm, minfreq = 10)
dim(x)
x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25)
dim(x)
x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25, remove_emptydocs = FALSE)
dim(x)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.