dtm_remove_sparseterms: Remove terms with high sparsity from a Document-Term-Matrix

View source: R/nlp_flow.R

dtm_remove_sparsetermsR Documentation

Remove terms with high sparsity from a Document-Term-Matrix

Description

Remove terms with high sparsity from a Document-Term-Matrix and remove documents with no terms.
Sparsity indicates in how many documents the term is not occurring.

Usage

dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = TRUE)

Arguments

dtm

an object returned by document_term_matrix

sparsity

numeric in 0-1 range indicating the sparsity percent. Defaults to 0.99 meaning drop terms which occur in less than 1 percent of the documents.

remove_emptydocs

logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to TRUE.

Value

a sparse Matrix as returned by sparseMatrix where terms with high sparsity are removed and documents without any terms are also removed

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, xpos == "NN")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)


## Remove terms with low frequencies and documents with no terms
x <- dtm_remove_sparseterms(dtm, sparsity = 0.99)
dim(x)
x <- dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = FALSE)
dim(x)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.